This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Thursday, 17 July 2014

Unassuming Unicode, the secret to characters on the web

Recently I got an e-mail with an interesting title:

How did they do that?

Just how did KLM insert an airplane into the subject of an e-mail? Unicode!

I needn't put a full description here, but unicode is the system that provides a unique identifier for every single character your computer is capable of displaying.  Yes Chinese, Yiddish, Maldivian, Airplane symbols, the lot!

So what does this look like under the hood?

To find out I copied the character into Notepad and saved it, ensuring I selected 'Unicode' as the encoding at the bottom of the 'Save As' dialog.

Then I viewed the raw binary of the file in a hex editor (I just happened to pick this online one).  The results were simply:

FF FE 08 27

What we're seeing here is the hexadecimal representation of the binary in the file.  You can confirm this using windows calculator in programming mode but for simplicity this is:

FF     11111111
FE     11111110
08     00001000
27     00100111

The first two bytes are telling us that is little-endian UTF-16, these are the byte order mark (BOM).  Endian (or endianness) simply tells us from which end we read the data first, which in this case means we read from right to left.

So doing this we now have (omitting the byte order marks):

27 08
Which just so happens to the unique identifier for the airplane symbol:

But why do you care about this?  You could've just copied and pasted the original symbol, right?

Well it just so happens that HTML encoding closely follows these unicode code points.  So if I wanted to use this character myself I'd want to be absolutely certain it'll render correctly.

To do this I'd first make sure my page is described as being encoded in unicode using the correct meta tag:
<meta charset="utf-8">
Then I can create the character using &#xnnnn where nnnnn is the unicode code point.  Therefore &#x2708 creates our airplane:

That's just one.  There are 109, 383 other characters out there, go and use 'em.

Saturday, 7 June 2014

Keeping your source, safe

Too many times now have I seen a fear of committing code, with many developers waiting until they are absolutely certain their code is damn near perfect before hitting commit.  I blame the terminology, commit sounds so final and carrying reputation consequences.  That's why I prefer to call them checkpoints:

A checkpoint is a point in time that you can return to - no matter what happens:

- Your hard drive fails
- You find yourself needing to backtrack
- You take a holiday
- You lose a 'life'

The more checkpoints you have, the more choice you're giving yourself in the future to return to.

That's why I advocate of checking your code in early and often. Don't worry if it's a work in progress, there are missing tests, it's not perfect. Check it in!

Of course I'm not advocating checking in crap, meaning there has to be rules:

- It should compile
- All tests pass
- You keep it on your own branch
- You include any new code since the last commit
- A commit message is nice (although not mandatory for every commit)

These are just common courtesy to your follow developers meaning they'll be able to pick up from where you left off for whatever reason.

Using source control like this keeps your code safe, provides an audit trail, and allows others to see your work.

Therefore I urge you commit often after all it's your branch.

Thursday, 27 March 2014

Reliance on implementation details

Recently I stumbled across an issue in a legacy app which didn't appear to make any sense.  The issue involved determining the precision of a Decimal which was giving different results for exactly the same value.

First of all I wrote a quick test to attempt to replicate the problem, which appeared to happen for 0.01:

    private decimal expectedDecimalPlaces = 2;

    public void Test2DecimalPoint_WithDecimal_ExpectSuccess()
        decimal i = 0.01m;
        int actual = Program.Precision(i);
        Assert.AreEqual(actual, expectedDecimalPlaces);

This passed, then I'd noticed in a particular method call the signature was expecting a Decimal, but was instead being supplied a Float (yes option strict was off [1]), meaning the Float was being implicitly converted. Quickly writing a test incorporating the conversion:

    private decimal expectedDecimalPlaces = 2;

    public void Test2DecimalPoint_CastFromFloat_ExpectSuccess()
        float i = 0.01f;
        int actual = Program.Precision((decimal)i);
        Assert.AreEqual(actual, expectedDecimalPlaces);

Causes the issue:

It seems to think 0.01 is to 3 decimal places!

So what's going on here? How can a conversion affect the result of Precison()? Looking at the implementation I could see it was relying on the individual bits the Decimal is made up from, using Decimal.GetBits() to access them:

    public static int Precision(Decimal number)
        int bits = Decimal.GetBits(number)[3];
        var bytes = BitConverter.GetBytes(bits)[2];

        return bytes;

The result of Decimal.GetBits() is a 4 element array, of which the first 3 elements represent the bits that go to make up the value of Decimal.  However this method relies only on the fourth set of bits - which represents the exponent. In the first test the decimal value was 1 with exponent 131072, the failed test had 10 and 196608.

When converting to binary we see the difference more clearly, I've named them bitsSingle for the failed test and bitsDecimal for the passing test:

bitsSingle     00000000 00000011 00000000 00000000
               |\-----/ \------/ \---------------/
               |   |       |             |
        sign <-+ unused exponent       unused
               |   |       |             |
               |/-----\ /------\ /---------------\
bitsDecimal    00000000 00000010 00000000 00000000

NOTE: exponent represents multiplication by negative power of 10

As you can see the exponent for bitsSingle is 3 (00000011) whereas the exponent for bitsDecimal is 2 (00000010), which represent negative powers of 10.

Looking back at the original numbers we can see how these both accurately represent 0.01:

bitsSingle has a value of 10, with an exponent of -3 = 10 -3
bitsDecimal has a value of 1, with an exponent of -2 = 10 -2

As you can see Decimal can represent the same value even though the underlying data differs. Precision() is only relying on the exponent and ignoring the value, meaning it's not taking into account the full picture.

But why is the conversion storing this number differently than when instantiated directly?  It just so happens that creating a new Decimal (which uses the Decimal constructor) uses a slightly different logic than that of the cast. So even though the number is correct, the underlying data is slightly different.

This brings us to the point of the article.  The big picture here is to remember that you should never rely on implementation details, rather only what can be accessed through defined interfaces.  Whether that be a webservice, reflection on a class, or peeking into the individual bits of a datatype.  Implementation details can not only change, but in the world of software - are expected to.

If you want to play around with the examples above I've uploaded them to GitHub.

[1]I know it's not okay and there isn't a single reason for this, however as usual with a legacy app we simply don't have the time / money to explicitly convert every single type in a 20,000 + loc project.

Wednesday, 18 December 2013

Highlights of the year (literally)

As the end of the year approaches, I thought it'd be prudent to make a list of all nuggets of advice and insight I've read this year:

Effective Programming: More Than Writing Code (Jeff Atwood)

It’s amazing how much you find you don’t know when you try to explain something in detail to someone else. It can start a whole new process of discovery.
There's no question that, for whatever time budget you have, you will end up with better software by releasing as early as practically possible, and then spending the rest of your time iterating rapidly based on real-world feedback. So trust me on this one: even if version 1 sucks, ship it anyway. 

Lehman's laws of software evolution
As an evolving program is continually changed, its complexity, reflecting deteriorating structure, increases unless work is done to maintain or reduce it.

Scrum: A breathtakingly Brief and Agile Introduction (Chris Sims, Hillary Louise Johnson)
The daily scrum should always be held to no more than 15 minutes. (Matt Asay)
Oracle has never been particularly community-friendly. Even the users that feed it billions in sales every quarter don't particularly love it.

The Art of Unit Testing: with Examples in .NET (Roy Osherove)
Finally, as a friend once said, a good bottle of vodka never hurts when dealing with legacy code.

Thursday, 17 October 2013

Recently I had the need to decode a Base64 string and make a PDF of it.  Usually I would've written a small utility app, but this time I rolled with powershell:

function decodeBase64IntoPdf([string]$base64EncodedString)
    $bytes = [System.Convert]::FromBase64String($base64EncodedString)
    [IO.File]::WriteAllBytes("C:\Users\medmondson\Desktop\file.pdf", $bytes)

I'm impressed with how quickly I can knock out a script like this (yes they are .NET assemblies) without having to load a new VS solution. Of course a lot more could be done to this (file format via an argument for example) but I thought I'd share it raw as I know I'll need to use it again one day.

Monday, 26 August 2013

The myth of software development

When you're developing software, have you ever thought "once this feature is complete I'll be done"? I'm the first to admit that there is always an end point in sight, believing once I've reached it I'll be able to say I'm finished.

Well guess what... software can never be considered finished, don't believe me?  Then why is Windows XP still being updated almost 12 years after its initial release?

Psychologically a lot of people compare a software project with more traditional types projects such as construction, however they are completely incomparable:

  • Software only ever reaches a state of acceptable functionality
  • Software is infinitely malleable meaning it can never reach a state of 'done'

Both of these reasons, in the same way as proving they aren't comparable to construction, show that starting a software project again is very rarely the right choice - instead adapt the software into the new state of acceptable functionality.

This is because software is the cumulative sum of all previous work, even reasonably small products will be the culmination of many man years.  In addition users understand how it works and all of the quirks of the features, including how to use them to the organisations advantage.

Therefore no matter how much spaghetti, ill named and awkward that legacy project is, it is almost never the right decision to start again from scratch.

Which is exactly why code needs to be maintainable, because you'll almost certainly won't be the only person who has to look after it.  Using tools such as resharper can help with this, and great to transform a spaghetti-ridden legacy project (and you may even manage to get some unit test coverage!) into something you can work with.

Therefore next time you want to start again from scratch think very carefully, as it's almost never the right choice.

Tuesday, 26 February 2013

Overview of type suffixes

I'd like to bring your attention to an area of the C# specification which is misunderstood by many:

Type Suffixes

Type suffixes are individual characters that you can append to 'any code representation of a value' (called a literal in .NET) which allows you to specify it's exact type.  They only relate to numbers - of which can be defined as one of two forms:

- Integer literals (Whole numbers)
- Real literals (More precision)

If you type 10 into your source code, the compiler will automatically interpret that as an integer type, however if you were to type 10.1 this would be automatically interpreted as a real type (because of the decimal point - the full rules are in the C# specification).  To demonstrate this I'll use the var keyword, which assumes a type based on it's initial assignment:

however if I type 10.1 I get a double (the default for a real literal):

Type suffixes allow you to override these defaults.

For example what if I wanted to specify a float?  Well it turns out there's a type suffix for this, f (single is a synonym for float):

and similarly decimal has m:

The point here is that the literal's type is defined the moment you enter it into the source code and not by the variable you are assigning it to. This becomes important in the scenario when you want to define a type where there isn't an implicit conversion available between the default type, and the variable being defined:

Here you're essentially attempting to store a number inside a box that's too small (usually referred to as a narrowing conversion).  To get around this you need to tell the compiler you actually wanted a decimal:

I understand this topic is somewhat basic, but I believe deserved an overview nonetheless.