Generating a word cloud (or not) from a Twitter hashtag

Word cloud showing most common questions under #askgove

Sample #askgove word cloud created from around 2,500 tweets

Education asked last Tuesday if we could create a word cloud on Friday from the questions asked on Twitter using the #askgove hashtag. One of those jobs that seems simple on the surface but isn’t!

  • Problem one – by Tuesday there were already thousands of tweets, and Twitter will only allow you to search so far back on a keyword.
  • Problem two – they wanted the cloud generated on Friday (when they go to print) so they could include as many #askgove questions as possible, which meant checking for new tweets every couple of hours during the week to compile an immense list.
  • Problem three – because there were so many tweets, it was impossible to go through and weed out all the extraneous words like reply, retweet, favorite, open, askgove before generating the cloud, to say nothing of all the stop words (and, a, the…). They wanted a cloud that highlighted the key questions being asked, so no words relating to usernames, no why/will/what/when… and sadly no swearing!
  • Problem four – I don’t work on Fridays.

I got as far as I could with it – I searched for #askgove on Twitter and pasted the available list of tweets so far into a program called word counter, to generate a list of words ranked by frequency. That weeded out some of the basic stop words. But how to turn that into a Wordle? I could see the most popular terms, but they only occur once in the text generated by the counter so the word cloud would be meaningless.

Step forward production, specifically a systems editor, who showed me a nifty bit of code which takes the word counter list and returns each word, repeated as many times as the frequency number next to it. Weed out the words we don’t want (check the ones we’re not sure about – ebacc, ict, hei – on Twitter), paste this into Wordle and voila! a word cloud.

I showed the process to the art director who works on Education, and mocked up a word cloud using the layout and colours she chose, to see whether it worked on the page.

I wrote detailed instructions for colleagues, and at their request I talked them through the process at my screen, so they could create the cloud without too many difficulties. They started to add to the list of tweets at the end of Wednesday (while I was still in, to check they’d got the process right).

And then…

…the word cloud was dropped from the supplement. This happens fairly often in journalism – a story is superceded by breaking news, the space is needed for advertising or a better alternative presents itself. The reason in this case was space – the word cloud simply didn’t work in the space available on the page. And they let us know early on Thursday, so my colleagues didn’t spend too long on it (sometimes we don’t get told at all).

So was it a waste of time? No. I learnt some valuable lessons, about how to generate word clouds but also about working with different departments (and colleagues) to create something for the paper.

Reflections

  • If something seems impossible at first glance don’t just dismiss it, there’s usually a solution and sometimes you have to put a bit of work in.
  • Ask for help if you don’t know how to do something – in such a big organisation there will usually be someone in the building who has the knowhow.
  • Collaboration is key – education came to us at the beginning with a clear idea of what they wanted but little knowledge of how it could be done; I took it as far as possible then consulted someone with the technical knowledge; and collaborated on the design so the editors could make a final decision. Sharing knowledge led to a better end result, even though it wasn’t used.
  • Now I know how to create a word cloud from any volume of text, so if it comes up again it’ll be easy (she says…).
  • Walking colleagues through a complicated process is better than just emailing a list of instructions, which can be confusing (some people learn better with visual aids) and can seem a little superior (not everyone responds well to being told what to do remotely).

I think that last one is the lesson I should really take to heart!

#libday8: Under starter’s orders

Library Day in the Life round 8 starts today – I’ll be posting at the end of each day and tweeting throughout (@katy_bird). Looking forward to reading everyone else’s exploits!

Working week, 16-18 January 2012

  • Working on plans for Olympics coverage: We’ve been chatting this week about how we can cover the London 2012 Olympics from an archive perspective. We’ll be blogging some archive stuff, and tweeting too, hopefully with coverage from previous London Olympics. We’re trying to initiate our own projects, rather than being approached by others all the time – it’s much better to be involved from the start so we can be realistic about what is achievable (learning from past mistakes!).
  • Wikipedia blackout: We didn’t see a massive influx of queries on Wednesday, when Wikipedia was blacked out for 24 hours to protest Sopa.  Optimists would say that’s because our journalists are above using Wikipedia, but it’s more likely that they’d figured out ways around the blackout. Our encyclopaedias made a star turn for Guardipedia, when Patrick Kingsley fielded questions from readers stumped by the blackout. Shame there was no mention of the librarians (and lots of library clichés!), but he did give us a shout out on Twitter.
  • Journalist queries included a 1996 article on the Olympics, Syria in numbers, recent social stories on China, examples for a panel on home experiments gone wrong, interviews and reviews for Russell Tovey and Jaime Winstone, net % change of GDP over time, MP quotes on the Work Programme and a land registry search.

 

Changes to From the archive

Last week the Guardian underwent a modest restyling, with several pages stripped back. As a consequence, our From the archive column will no longer appear in the print version of the paper (except on Saturdays), but we will still be posting it online.

While there’s more caché to having a column in the paper, there are advantages to working web-only.

  • The word-limit isn’t as restrictive, so we won’t need to edit a good piece down to 480 words, or tack on an unrelated article if it’s too short (although we don’t want to start posting 1,500-word essays either).
  • We can play around with the format, using strong graphics or images if we find them instead of text.

There’s some extra work involved in uploading articles straight to the web though.

  • The pieces don’t run past a sub-editor, so we need to pay more attention to the copy, comparing it with the original article to make sure there are no missing words or stray commas.
  • Sometimes we’ll have to write our own headlines, where the original doesn’t have one or has a poor one (19th century articles tend to be wordy).

I worked on the first batch of web-only articles last week and found a few errors in the texts. We’ve changed the rota to take that into account, so one person preps the article for uploading and someone else subs it, so hopefully we’ll catch most of the mistakes before we launch! I’ll be paying more attention to it when I find articles from now on, too.

Working week, 9-11 January 2012

  • Film Datablog post: Nominations were announced for several awards last week (WGAs and Bafta longlist), but Film were busy again and didn’t have time to update the spreadsheet. I need to check with them a few days in advance next time, to make sure they can cover it, or get someone else involved for the days I’m not here. The Golden Globes are next week, I can’t decide whether to get up crazy early to update the page or just wait until I’m in the office. Shame they’re not on terrestrial telly!
  • Changes to From the archive: The Guardian dropped a number of pages from the paper yesterday, including shortening the Comment section, which means our on this day column is no longer in the print version. But fear not, we’re continuing online (may blog this later). I had four to upload on Monday, phew! I’ve rewritten the instructions on how to do it all to reflect the changes, too.
  • Work experience: We have a work experience bod with us for a fortnight, so I’ve been showing him some of the longer-term projects we’re working on. There aren’t many opportunities to train people in ourdepartment, so it’s good to get a bit of experience.
  • Developing the intranet: Our plans for the intranet took a back seat over Christmas, but we’re getting back into it again. I’d hit a point with the design where I needed someone else to come in and advise, and a colleague is now helping me with the images, so we should be ready to relaunch soon.
  •  Journalist queries included the current electorate of Denmark (took a bit of digging but I now know the words for election, postal votes and invalid votes in Danish), anything on Nick Walkley for an interview, background on Kate Freud and interviews by Lucian on family, a specific Guardian story on Ed Miliband “by a Labour frontbencher” and polls on public attitudes to cuts.

Working week, 3-4 January 2012

Wow, where did 2011 go?

  • Corrections: Lots to catch up on from the festive period, so it took me a while to weedle my way through the list. Luckily otherwise it was a slow start to the week back!
  • Department meeting: planning blog coverage for Olympics and other upcoming events, what to do with From the archive (will be dropped from the paper soon), engaging more with Twitter followers.
  • Intranet: I’ve not worked on it for a while, but a colleague has offered to help with the design so hopefully we can relaunch soon.
  • Fruitless searches: I was asked for the text of an Early Day Motion on Thatcher’s funeral from 2007. A search of the parliament.uk site was fruitless, so I checked for cuts on Factiva and it turns out the EDM wasn’t actually tabled (too controversial I assume!). If something isn’t there there’s usually a good reason. I did discover the really simple database of EDMs on the parliament.uk site for next time, though.
  • Journalist queries included whether there was dancing in the Welsh valleys when Churchill died, recent polls on Thatcher (please, no more Thatcher queries…), the source of Rick Santorum’s “CS Lewis” quote (unknown, to all but him), an article from the BMJ (we don’t subscribe but the health desk do), and locating a review that’s not on the digital archive (I suggested the reader contact a specialist archive – another case of a review only appearing in an early edition so not archived, I think).

CPD23 Thing 23 (!): Time for reflection

Child looking out to see on Cornish beach

Looking to the year ahead

CPD23 Thing 23 post

I can’t believe I’ve finished CPD23! Well beyond the deadline, and not quite by my self-imposed end-of-year deadline either, but I made it anyway. I’m not always great at seeing things through, so I’m proud to have got this far. What am I going to blog about now?!

I’ve written a PDP for Chartership, so I’ve drawn on that as well as CPD23 for this exercise.

What have I learned?

CPD23 has been hugely helpful. I’ve said it before, but in a shrinking department with limited budget there aren’t many opportunities for career development or discussions. The programme has introduced me to a new network of career-minded info pros, as well as tools and techniques I wouldn’t have come across otherwise.

Gaps in my knowledge

  • Digital skills (I have the basics but don’t always know the best way to harness them, or the best tools for the job).
  • Promoting the department (internally and externally – must do more).
  • Getting involved in the profession more fully (through face-to-face groups and events, and online).

Goals

My main goal for this year is to charter. I’m in the process of compiling my portfolio, so there’s a lot of work ahead but I’m hopeful!

Beyond that, I want to get involved in the profession and address some of the other gaps identified above. 2012 is going to be a year for career development!