Ubuntu ttf fonts, all at once.

Ever wanted to install every ttf font in the Ubuntu repositories at once?   I have the command for you.

I was using inkscape, and annoyed with the meagre list of fonts available, so I assembled this with a quick regex and a few manual corrections.  It’s about 300 packages, takes up nearly 900Mb of space, and is a 335Mb download. (Good compression on fonts, apparently!)

(Edit: you may wish to skip ttf-mathematica4.1 and  ttf-mscorefonts-installer, as they require you to agree to licences.  I’ve put them at the end for easier deletion.)

sudo apt-get install fonts-arphic-uming fonts-hanazono fonts-lklug-sinhala fonts-mona fonts-opensymbol fonts-unfonts-core fonts-unfonts-extra ttf-adf-accanthis ttf-adf-baskervald ttf-adf-berenis ttf-adf-gillius ttf-adf-ikarius ttf-adf-irianis ttf-adf-libris ttf-adf-mekanus ttf-adf-oldania ttf-adf-romande ttf-adf-switzera ttf-adf-tribun ttf-adf-universalis ttf-adf-verana ttf-aenigma ttf-alee ttf-ancient-fonts ttf-anonymous-pro ttf-aoyagi-kouzan-t ttf-aoyagi-soseki ttf-arabeyes ttf-arphic-bkai00mp ttf-arphic-bsmi00lp ttf-arphic-gbsn00lp ttf-arphic-gkai00mp ttf-arphic-ukai ttf-arphic-ukai-mbe ttf-arphic-uming ttf-atarismall ttf-baekmuk ttf-bengali-fonts ttf-beteckna ttf-bitstream-vera ttf-bpg-georgian-fonts ttf-breip ttf-century-catalogue ttf-comfortaa ttf-dejavu ttf-dejavu-core ttf-dejavu-extra ttf-dejima-mincho ttf-denemo ttf-devanagari-fonts ttf-droid ttf-dustin ttf-dzongkha ttf-ecolier-court ttf-ecolier-lignes-court ttf-engadget ttf-essays1743 ttf-evertype-conakry ttf-f500 ttf-fanwood ttf-farsiweb ttf-femkeklaver ttf-fifthhorseman-dkg-handwriting ttf-freefarsi ttf-freefont ttf-georgewilliams ttf-gfs-artemisia ttf-gfs-baskerville ttf-gfs-bodoni-classic ttf-gfs-complutum ttf-gfs-didot ttf-gfs-didot-classic ttf-gfs-gazis ttf-gfs-neohellenic ttf-gfs-olga ttf-gfs-porson ttf-gfs-solomos ttf-gfs-theokritos ttf-goudybookletter ttf-gujarati-fonts ttf-hanazono ttf-inconsolata ttf-indic-fonts ttf-indic-fonts-core ttf-ipafont-jisx0208 ttf-ipafont-uigothic ttf-isabella ttf-jsmath ttf-junicode ttf-jura ttf-kacst ttf-kacst-one ttf-kanjistrokeorders ttf-kannada-fonts ttf-khmeros ttf-khmeros-core ttf-kiloji ttf-komatuna ttf-konatu ttf-kouzan-mouhitsu ttf-lao ttf-larabie-deco ttf-larabie-straight ttf-larabie-uncommon ttf-levien-museum ttf-levien-typoscript ttf-lg-aboriginal ttf-liberation ttf-lindenhill ttf-linex ttf-linux-libertine ttf-lyx ttf-malayalam-fonts ttf-manchufont ttf-marvosym ttf-mgopen ttf-mikachan ttf-misaki ttf-mona ttf-monapo ttf-motoya-l-cedar ttf-motoya-l-maruberi ttf-mph-2b-damase ttf-mplus ttf-nafees ttf-nanum ttf-nanum-coding ttf-nanum-extra ttf-ocr-a ttf-oflb-asana-math ttf-oflb-euterpe ttf-okolaks ttf-oldstandard ttf-opendin ttf-oriya-fonts ttf-paktype ttf-prociono ttf-punjabi-fonts ttf-radisnoir ttf-root-installer ttf-rufscript ttf-sawarabi-gothic ttf-sawarabi-mincho ttf-sazanami-gothic ttf-sazanami-mincho ttf-sil-abyssinica ttf-sil-andika ttf-sil-charis ttf-sil-dai-banna ttf-sil-doulos ttf-sil-ezra ttf-sil-galatia ttf-sil-gentium ttf-sil-gentium-basic ttf-sil-nuosusil ttf-sil-padauk ttf-sil-scheherazade ttf-sil-sophia-nubian ttf-sil-yi ttf-sil-zaghawa-beria ttf-sinhala-lkmug ttf-sjfonts ttf-staypuft ttf-summersby ttf-tagbanwa ttf-takao ttf-takao-gothic ttf-takao-mincho ttf-takao-pgothic ttf-tamil-fonts ttf-telugu-fonts ttf-thai-arundina ttf-thai-tlwg ttf-tiresias ttf-tmuni ttf-tomsontalks ttf-tuffy ttf-ubuntu-font-family ttf-ubuntu-title ttf-umefont ttf-umeplus ttf-unfonts-core ttf-unfonts-extra ttf-unifont ttf-unikurdweb ttf-uralic ttf-vlgothic ttf-wqy-microhei ttf-wqy-zenhei ttf-xfree86-nonfree ttf-xfree86-nonfree-syriac ttf-yanone-kaffeesatz ttf-yofrankie ttf-mathematica4.1 ttf-mscorefonts-installer

Some slightly noisy peace and quiet.

I’m currently working in an open concept space – surrounded by whiteboards, but not, ironically, people who are working collaboratively*.  Thus, the space has become a “quiet only” zone.

Given the recent studies that shows ambient noise is a booster of productivity and creativity (eg. http://www.jstor.org/stable/info/10.1086/665048, which you should take with a grain of salt – I haven’t read anything other than the title), I thought I’d share something that my colleague introduced me to… http://coffitivity.com/.

It’s been featured on lifehacker, and plenty of other places.   Given that I don’t drink coffee, you wouldn’t think it would be particularly useful, but surprisingly, I’m enjoying having ambient noise.   It drowns out other people’s loud conversations, and simultaneously is somewhat relaxing.  It also sounds pretty decent over a light veneer of baroque music.

Anyhow, one more tool to add to my toolkit.  Thanks Jake!

*Please see comment below.

Faster is better…

I was going to write about the crazy street preacher I met on the bus this morning, but I have some far more topical stuff to mention.

I’ve been working on some code in python, which, much to my dismay, was taking a LONG time to run.  Around 11 hours of processing time, using 100% of all 8 CPUs (summed up user time of 6110 minutes – about 100 hours of total CPU time), for something I figured should take about about an hour.

I’d already used profiling to identify that the biggest bottlenecks were in two places – putting and retrieving information from a single queue shared by all 8 worker processes, as well as a single function of code that is central to the calculations being done by the workers.  Not being able to figure out what was happening in the worker’s code, I spent some time optimizing the queue management, with some fancy queue management (which turned out later to be outperformed by simply picking a random number for a queue to push the data into) and a separate queue for each process (which did, in fact cut wasted time significantly).  Before that, it had been hovering around the 20 hours, with 8 processes. So, I’d already been doing well, but it was well above where I thought it should be.

So, following up on John’s comment the other day, I gave Pypy a shot.  Unfortunately, the first thing I discovered was that pypy doesn’t work with numpy, the numerical computing library for python.  No problem – I was only using it in one place.  It only took a few seconds to rewrite that code so that it used a regular 2D array.

Much to my surprise, I started getting an error elsewhere in the code, indicating that a float was being used as an index to a matrix!

Indeed, it only took a few seconds to discover that the code was calling the absolute value  of an int, and the returned value was a float – not an integer…

Which means that numpy was hiding that mistake all along, without warning or error!  Simply putting a cast on the float (eg, int(math.fabs(x))) was sufficient to drive the total running time of the code to 111m of user time on the next data set of comparable size. (about 2 hours, with a real time of 48 minutes because the fancy queue manager mentioned above wasn’t working well).

Yes, I’m comparing apples with oranges by not rerunning on the data set, but I am trying to process a large volume of data, and I wasn’t expecting such a massive performance boost.- I’ll get around to proper bench marking when that’s done.

Unfortunately, in the end, I never could get pypy running.  It turns out that it’s incompatible with pysam, a library I’m using to process bam (next-generation sequencing) files.  I don’t have any alternatives for that library and I can’t drop it, so pypy is out.  However, it did help me identify that numpy is far too accepting of bad data for array indexes, and while it is able to handle it correctly, numpy does so with a huge time penalty on your code.

So, lessons learned: profiling is good, pypy isn’t ready for heavy computing use, and numpy should be treated with caution!  And, of course, yes, you can write fast code with python – you just have to know what you’re doing!

Check your assumptions

I just went through one of those random Linux trial-by-fire exercises.  I had two web servers, one cloned from the other, behaving differently: One would send emails, the other wouldn’t.

After walking through the tree of all possible things that could be the problem: user installed software, system software, system configurations, and right down to logs and individual files…  and then realized the most obvious source of the difference:  The processes that handle mail on one server had died, but not on the other.

Yes, 2 hours of debugging mail handling on a linux machine only to discover that I could have figured this out with “ps aux | grep mail” in about 10 seconds, had I known what to look for.

Well, that’s the nature of troubleshooting – the answer is always obvious after you find it.

On the bright side, it means my lab has a shiny new blog to play with – and it seems like everything is working now.

 

Python “threading”

Just a quick rant about threading.  After working with Java for so long, I’d gotten used to the idea that a thread is an independent entity, which can work for you without slowing down the main body of your program.  In python, that’s really not the case.

In python, a thread shares memory with the main “thread” of your code, which prevents it from running on other CPUs, or running independently.  In fact, with a python thread, you’re stuck with every line of code running on the same core, with the limitation that only one line of code can be processed at a time – meaning that all your threads take turns passing lines of code to the CPU to be run. (Or, as a mental image, that’s how it’s working, the reality is a bit more subtle.)

Unfortunately, that means that python threads don’t really speed up your code, and if there are enough of them, they can slow it down significantly.

The solution turned out to be to use a module in python called “multiprocessing”, which allows you to spawn processes instead of threads, which means that each individual process can run on a different CPU (if you have enough cores…), but does not share any memory with the main process or thread of your code.  Thus, you have to work out a system of thread-safe (process-safe?) queues, where each process can dump information into a buffer, allowing other threads or processes to pick up information and process it independently.  The worker threads can consume the information in parallel, giving you a speed up of the running time (wall time) of your code.

All in all, it’s actually a relatively elegant system, not much different than many other languages, with the exception of the terminology – processes vs threads.  Python got it right, but it took me a while to figure out that I was using the terms incorrectly.

At any rate, without any optimization, multiprocessing with 30 processes brings down the wall time of my code from about 3 hours down to about 15 minutes.  It’s almost time to start looking into c code optimization…  Almost. (-;

Link to a Ted talk – reversing desertification

I enjoy listening to TED talks while I work – although I usually end up tuning out for most of half of it while I focus on coding, but this one kept my attention throughout.  I think I got 10 lines of code written in 20 minutes, albeit it was 10 good lines…

Anyhow, reversing desertification is always an interesting topic, and it’s a great counter point to the usual doom and gloom of climate change articles you see in the press.  Either way, if you can find 22 minutes, I highly suggest watching this talk.

Allan Savory – How to green the world and reverse climate change

 

Illumina 450k methylation array re-annotation

The Kobor lab, where I’m currently working, does a lot of epigenetics and have been working with the Illumina 450K Human methylation array.  In some ways, it’s a stepping stone towards a next-gen platform, and in other ways, it’s still a perfectly valid platform all on it’s own.  I’m not really an array person (yet?), but I can see the advantages when it comes to bisulfite sequencing – mapping with bisulfite treated DNA is a bit of a dark art. (I know it can be done, but it’s not ideal.)

Anyhow,  they’ve undertaken an interesting project, with a few collaborators,  to re-annotate the 450k methylation array, identifying probes that are give erroneous results due to the presence of snps or cross hybridization.  They show it in action with a couple of data sets as well, so it’s not just a theory paper.  Anyhow, it’s a quick, but interesting must-read for those interested in using the 450k human methylation array.

You can find it at http://www.epigeneticsandchromatin.com/content/6/1/4/abstract

Wacom!

Ok, you’ll never guess what I came across today!  Or, well, if you’ve read the topic, you might…  I found an (apparently) owner-less Wacom tablet.  That probably doesn’t mean much to most people, but in my experience, you don’t just find Wacom tablets lying around gathering dust… unless they’re broken of course.

So, while I was cleaning up the desk next to mine for a rotation student, I discovered it just sitting there.  I’d never seen one without a fancy engineering template taped on top, so I assumed it was either broken or missing a piece or relegated to its dust gathering status for some equally horrific disfigurement.  I cheerfully set it aside, thinking I’d just plug it in later to see why it was discarded.

Well, much to my surprise, the thing works!   If you’ve never seen a Wacom tablet before, it’s the king of mice, the grand father of touch pads, the cats pijamas.  Holy cow are they cool.  I plugged it in and started doodling with it immediately in inkscape.  Seriously, that is a nice piece of hardware.

I have the feeling that, if no one claims it, I’ll be doing some serious doodling for all of my projects – and with my plans for doing visualization work, this could be the start of some really fun images.  While I’m still feeling pumped about it, I’ll challenge myself to post something drawn with it for next week…  in 7 days I should be able to do something neat, otherwise I am clearly unworthy of such a lucky find!

Is Encode bunk?

Ok, I’m sick, so this is a very short post.  I just stumbled onto this article in the guardian.  Not being a Brit, I have no idea if it’s even a remotely reputable journal, or why this piece is so sensationalist.  So… scientists see evidence and are working to understand whether much of the genome serves a purpose and they disagree on the interpretation.  Neither side has conclusive evidence, but the Encode project certainly has evidence that makes it’s claims seem valid.

In contrast, a bunch of biologists seem to have jumped on it and insist that most of the DNA in the human genome is still “junk” and does nothing.

While I don’t support the side that seems to be calling “BS” on the Encode project, the character of the article seems unnecessarily vitriolic.  Does the UK have republicans?

Edit: Finally feeling well enough to get back on my computer and look for the source of this argument: Here.  And after reading a few pages…Wow! I can’t believe that was published as is.  The abstract alone sounds like someone got up on the wrong side of the bed, and then ate nettles for breakfast.

A few things on python

A few quick notes.  Python, so far, is really a neat language.  It’s fast to write in, it’s easy to do unit tests and using pydev made the transition to python a lot easier from Java.  I’m not nearly a professional python developer by any stretch, but I can bang out python code pretty quickly now, and I’m pretty happy.

Duck-typing does annoy the heck out of me still, because I know things will crash at run time instead of easily caught errors being flagged while writing the code or while compiling, but I’m sure I’ll get over that, and unit tests do compensate for it.

I’ve also picked up Egit for version control, and that has been a bit confusing – not because it’s git, but because the git model of software development doesn’t provide an external backup for your repository unless you push it to the server.  Somehow, I hadn’t actually realized that until I started playing with it.  It simply means pushing to a branch at the end of the day for a backup, or at stable points, which isn’t a bad idea, really.  Once I get more comfortable with the system, I’m sure it’ll work far better than SVN ever did for me.

Beyond that, I’m also pretty impressed with the libraries available for python.  I’ve played a little with Pysam, and have been reasonably impressed with the results.  I haven’t done any benchmarks on it yet, but I’m ok with it so far.  It did take me a while to realize that an aligned read’s “.aend” property is the 3′ end and the “.pos” is the 5′ end on the positive strand… and I’m not convinced that I haven’t somehow introduced an off-by one error in the code, but these things will be sorted out in time.

Otherwise, I think I can say I’m reasonably happy with the choice of python.  I’m looking forward to playing with a few other libraries, like matplotlib, and scientific python, as well as Tkinter, and – although I know there will be a learning curve on it – I think there are more than sufficient python tutorials and forums to help make it reasonably easy to get through.

Ok, time for some more coding. (-: