Celerra Open-File Cache Bug

Dec 9th, 2010 | Filed under Hardware, UNIX System Administration

It seems the NFS problem we were having is due to a bug in Celerra NAS codes 5.6.36 to 5.6.43 (fixed in 5.6.44). We upgraded to NAS code 5.6.40 before our EMC support ended. I found something in the EMC Knowledgebase about customers having performance issues under heavy CIFS load due to the CIFS trickle sync feature that was added, which lead to an issue with insufficient open-file cache resources because that feature allows CIFS to use the open-file cache. The open-file cache is used for NFS too, but instead of causing a performance issue when we run out of open-file cache resources, the NFS service simply starts returning NFS3ERR_IO instead! The suggested workaround, aside from upgrading, is to set the following parameter:

param cifs ofCache=0

We put that in our /nas/site/slot_param file to make it global to both Data Movers instead of simply putting in the /nas/server/slot_N/param file (where N is your slot number of course). The suggested workaround also included setting this on the command line as so:

.server_config server_2 -v "param cifs ofCache=0"

This parameter requires a reboot however, as indicated by server_param:

[nasadmin@nas-dl-cs ~]$ server_param server_2 -facility cifs -info ofCache
server_2 :
name                    = ofCache
facility_name           = cifs
default_value           = 1
current_value           = 0
configured_value        = 0
user_action             = reboot DataMover
change_effective        = reboot DataMover
range                   = (0,4294967295)
description             = NA

Setting it on the command line is useless because a reboot is required anyway, so I am not sure why that was even suggested. They also indicated this in the fix as well. Note that this ofCache param won’t “appear” until it is actually set, and it’s not in the documentation. Magical, right? I saw a reference online about this being a setting for NFS too, but I think that was for NAS code 6.x. There wasn’t enough detail.

We rebooted our Data Movers to set this value from the site slot_param file. Hopefully this will solve the problem, and it does sound like this bug is the cause of our problem. This information was hard to find. Sometimes I am amazed at the process that leads to finding stuff like this. I mean, have you ever tried searching in EMC Powerlink? :-)

EMC NFS Round 2

Dec 2nd, 2010 | Filed under Hardware, UNIX System Administration

We had the same problem today as yesterday, though at a much smaller scale. The following error messages seem to happen when NFS goes out to lunch:

Error	12/2/10 13:32	CFS	No free entries in open file cache
Error	12/2/10 13:32	CFS	last message repeated 101 times

I knew the word “cache” was in there somewhere. This combined with the recoverable single bit errors seems to be the problem so far. We failed over to our standby Data Mover and power-cycled the faulted primary. For now we’re just running on the old standby. It has not shown any of these errors yet. I imagine we’ll have to try failing back to see if power-cycling fixed it. I have a feeling the primary Data Mover’s hardware is getting flakey.

Fun times.

NFS Fun

Dec 1st, 2010 | Filed under Hardware, UNIX System Administration

I really wanted to title this “Frame 527 hates Frame 526″ or something like that. These two Ethernet frames were perfect examples of an NFS problem we’ve seen in the last few weeks at work. We were having intermittent I/O errors, which usually resulted in a Bus Error in the process that was lucky enough to experience the problem. This was really hard to track down because it was so infrequent. It did seem the frequency was picking up lately, but today it just exploded. I was easily able to reproduce the problem and capture some packets on one of our Linux login servers. The following two Ethernet frames show exactly what happened. First we have frame 526 asking for an UNSTABLE write:

No.     Time        Source                Destination           Protocol Info
    526 4.036376    164.107.112.73        164.107.112.55        NFS      V3 WRITE Call (Reply In 527), FH:0x062d9726 Offset:0 Len:3488 UNSTABLE

Frame 526 (834 bytes on wire, 834 bytes captured)
    Arrival Time: Dec  1, 2010 13:17:02.415785000
    [Time delta from previous captured frame: 0.000016000 seconds]
    [Time delta from previous displayed frame: 0.000016000 seconds]
    [Time since reference or first frame: 4.036376000 seconds]
    Frame Number: 526
    Frame Length: 834 bytes
    Capture Length: 834 bytes
    [Frame is marked: False]
    [Protocols in frame: eth:ip:tcp:rpc:nfs]
    [Coloring Rule Name: TCP]
    [Coloring Rule String: tcp]
Ethernet II, Src: Dell_54:c8:38 (00:13:72:54:c8:38), Dst: Clariion_06:41:5d (00:60:16:06:41:5d)
    Destination: Clariion_06:41:5d (00:60:16:06:41:5d)
        Address: Clariion_06:41:5d (00:60:16:06:41:5d)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
    Source: Dell_54:c8:38 (00:13:72:54:c8:38)
        Address: Dell_54:c8:38 (00:13:72:54:c8:38)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
    Type: IP (0x0800)
Internet Protocol, Src: 164.107.112.73 (164.107.112.73), Dst: 164.107.112.55 (164.107.112.55)
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00)
        0000 00.. = Differentiated Services Codepoint: Default (0x00)
        .... ..0. = ECN-Capable Transport (ECT): 0
        .... ...0 = ECN-CE: 0
    Total Length: 820
    Identification: 0xe37f (58239)
    Flags: 0x04 (Don't Fragment)
        0... = Reserved bit: Not set
        .1.. = Don't fragment: Set
        ..0. = More fragments: Not set
    Fragment offset: 0
    Time to live: 64
    Protocol: TCP (0x06)
    Header checksum: 0x2aed [correct]
        [Good: True]
        [Bad : False]
    Source: 164.107.112.73 (164.107.112.73)
    Destination: 164.107.112.55 (164.107.112.55)
Transmission Control Protocol, Src Port: 723 (723), Dst Port: nfs (2049), Seq: 147501, Ack: 9169, Len: 768
    Source port: 723 (723)
    Destination port: nfs (2049)
    Sequence number: 147501    (relative sequence number)
    [Next sequence number: 148269    (relative sequence number)]
    Acknowledgement number: 9169    (relative ack number)
    Header length: 32 bytes
    Flags: 0x18 (PSH, ACK)
        0... .... = Congestion Window Reduced (CWR): Not set
        .0.. .... = ECN-Echo: Not set
        ..0. .... = Urgent: Not set
        ...1 .... = Acknowledgment: Set
        .... 1... = Push: Set
        .... .0.. = Reset: Not set
        .... ..0. = Syn: Not set
        .... ...0 = Fin: Not set
    Window size: 16572
    Checksum: 0x2c7e [validation disabled]
        [Good Checksum: False]
        [Bad Checksum: False]
    Options: (12 bytes)
        NOP
        NOP
        Timestamps: TSval 2332632257, TSecr 32642440
    [SEQ/ACK analysis]
        [This is an ACK to the segment in frame: 525]
        [The RTT to ACK the segment was: 0.000016000 seconds]
    TCP segment data (768 bytes)
[Reassembled TCP Segments (3664 bytes): #523(1448), #524(1448), #526(768)]
    [Frame: 523, payload: 0-1447 (1448 bytes)]
    [Frame: 524, payload: 1448-2895 (1448 bytes)]
    [Frame: 526, payload: 2896-3663 (768 bytes)]
Remote Procedure Call, Type:Call XID:0x8ad2dd5e
    Fragment header: Last fragment, 3660 bytes
        1... .... .... .... .... .... .... .... = Last Fragment: Yes
        .000 0000 0000 0000 0000 1110 0100 1100 = Fragment Length: 3660
    XID: 0x8ad2dd5e (2329075038)
    Message Type: Call (0)
    RPC Version: 2
    Program: NFS (100003)
    Program Version: 3
    Procedure: WRITE (7)
    [The reply to this request is in frame 527]
    Credentials
        Flavor: AUTH_UNIX (1)
        Length: 76
        Stamp: 0x00a6aa46
        Machine Name: fl1.cse.ohio-state.edu
            length: 22
            contents: fl1.cse.ohio-state.edu
            fill bytes: opaque data
        UID: 7798
        GID: 10
        Auxiliary GIDs
            GID: 10
            GID: 11
            GID: 20
            GID: 275
            GID: 400
            GID: 558
            GID: 5727
            GID: 7176
    Verifier
        Flavor: AUTH_NULL (0)
        Length: 0
Network File System, WRITE Call FH:0x062d9726 Offset:0 Len:3488 UNSTABLE
    [Program Version: 3]
    [V3 Procedure: WRITE (7)]
    file
        length: 32
        [hash: 0x062d9726]
        decode type as: unknown
        filehandle: 4F00000001000600CE3E7E009660414F4F00000001000600...
    offset: 0
    count: 3488
    Stable: UNSTABLE (0)
    Data: <DATA>
        length: 3488
        contents: <DATA>

Then we have the reply in the very next frame:

No.     Time        Source                Destination           Protocol Info
    527 4.037359    164.107.112.55        164.107.112.73        NFS      V3 WRITE Reply (Call In 526) Error:NFS3ERR_IO

Frame 527 (130 bytes on wire, 130 bytes captured)
    Arrival Time: Dec  1, 2010 13:17:02.416768000
    [Time delta from previous captured frame: 0.000983000 seconds]
    [Time delta from previous displayed frame: 0.000983000 seconds]
    [Time since reference or first frame: 4.037359000 seconds]
    Frame Number: 527
    Frame Length: 130 bytes
    Capture Length: 130 bytes
    [Frame is marked: False]
    [Protocols in frame: eth:ip:tcp:rpc]
    [Coloring Rule Name: TCP]
    [Coloring Rule String: tcp]
Ethernet II, Src: Clariion_06:41:5d (00:60:16:06:41:5d), Dst: Dell_54:c8:38 (00:13:72:54:c8:38)
    Destination: Dell_54:c8:38 (00:13:72:54:c8:38)
        Address: Dell_54:c8:38 (00:13:72:54:c8:38)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
    Source: Clariion_06:41:5d (00:60:16:06:41:5d)
        Address: Clariion_06:41:5d (00:60:16:06:41:5d)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
    Type: IP (0x0800)
Internet Protocol, Src: 164.107.112.55 (164.107.112.55), Dst: 164.107.112.73 (164.107.112.73)
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00)
        0000 00.. = Differentiated Services Codepoint: Default (0x00)
        .... ..0. = ECN-Capable Transport (ECT): 0
        .... ...0 = ECN-CE: 0
    Total Length: 116
    Identification: 0xa850 (43088)
    Flags: 0x00
        0... = Reserved bit: Not set
        .0.. = Don't fragment: Not set
        ..0. = More fragments: Not set
    Fragment offset: 0
    Time to live: 64
    Protocol: TCP (0x06)
    Header checksum: 0xa8dc [correct]
        [Good: True]
        [Bad : False]
    Source: 164.107.112.55 (164.107.112.55)
    Destination: 164.107.112.73 (164.107.112.73)
Transmission Control Protocol, Src Port: nfs (2049), Dst Port: 723 (723), Seq: 9169, Ack: 148269, Len: 64
    Source port: nfs (2049)
    Destination port: 723 (723)
    Sequence number: 9169    (relative sequence number)
    [Next sequence number: 9233    (relative sequence number)]
    Acknowledgement number: 148269    (relative ack number)
    Header length: 32 bytes
    Flags: 0x18 (PSH, ACK)
        0... .... = Congestion Window Reduced (CWR): Not set
        .0.. .... = ECN-Echo: Not set
        ..0. .... = Urgent: Not set
        ...1 .... = Acknowledgment: Set
        .... 1... = Push: Set
        .... .0.. = Reset: Not set
        .... ..0. = Syn: Not set
        .... ...0 = Fin: Not set
    Window size: 12288
    Checksum: 0x26db [validation disabled]
        [Good Checksum: False]
        [Bad Checksum: False]
    Options: (12 bytes)
        NOP
        NOP
        Timestamps: TSval 32642440, TSecr 2332632257
    [SEQ/ACK analysis]
        [This is an ACK to the segment in frame: 526]
        [The RTT to ACK the segment was: 0.000983000 seconds]
Remote Procedure Call, Type:Reply XID:0x8ad2dd5e
    Fragment header: Last fragment, 60 bytes
        1... .... .... .... .... .... .... .... = Last Fragment: Yes
        .000 0000 0000 0000 0000 0000 0011 1100 = Fragment Length: 60
    XID: 0x8ad2dd5e (2329075038)
    Message Type: Reply (1)
    [Program: NFS (100003)]
    [Program Version: 3]
    [Procedure: WRITE (7)]
    Reply State: accepted (0)
    [This is a reply to a request in frame 526]
    [Time from request: 0.000983000 seconds]
    Verifier
        Flavor: AUTH_NULL (0)
        Length: 0
    Accept State: RPC executed successfully (0)
Network File System, WRITE Reply  Error:NFS3ERR_IO
    [Program Version: 3]
    [V3 Procedure: WRITE (7)]
    Status: NFS3ERR_IO (5)
    file_wcc
        before
            attributes_follow: value follows (1)
            attributes
                size: 0
                mtime: Dec  1, 2010 13:17:02.000782000
                    seconds: 1291227422
                    nano seconds: 782000
                ctime: Dec  1, 2010 13:17:02.000782000
                    seconds: 1291227422
                    nano seconds: 782000
        after
            attributes_follow: no value (0)

This is significant to me, not only because it shows the problem directly, but because it tells me the write errors I’ve noticed were not due to a delay of some kind that caused an RPC timeout. This was an NFS3ERR_IO error returned. That’s good to know, as well as the fact it came back immediately as such.

My concern about an RPC timeout came from realizing that all of our NFS exports are mounted using the “soft” option. A timeout that takes too long can create the same I/O errors. That’s why you use the default “hard” option for anything that’s important. We have an EMC serving SAN and NAS storage. Our environment is fairly large for a college department, but despite the fact our EMC is over 5 years old, it performs like magic (until recently anyway). Even though we are doing soft NFS mounts, we’ve never had this problem. This was my first guess however, because it made sense. Tracking down the cause is the problem. One of my complaints about devices like our EMC is that the server portion is essentially a black box. I can’t get in there and see what’s going on beyond what I can view in logs through the control station, etc. I am not an EMC expert however, but that’s my understanding at this point. If this were a Unix NFS server, I’d have been logged into it poking around.

These two frames make it pretty clear that the write errors are actually the result of the RPC reply from the NFS server directly. There is no timeout issue. The soft vs. hard mounting shouldn’t make a difference. Something else is very wrong here. At least these two frames cut down the potential problems somewhat. After looking at the logs of the primary Data Mover, it was clear there was some issue with the cache. It seems it was not able to update its cache, and perhaps because these were UNSTABLE writes that don’t have to go to non-volatile storage, it simply could not write them to its cache. I would expect it to disable cache or simply try to commit the data to disk, thus choosing to suffer a performance hit instead of potentially causing data issues for the client. This is all theory however. I don’t even have that error message anymore because rebooting the Data Mover filled the log again.

As previously mentioned, the solution for now was to reboot the Data Mover. This solved the immediate problem. On reboot we did see some other ominous errors its log:

Error	12/1/10 14:40	SYSTEM	envmon: 10 recoverable Single Bit Errors occured
Error	12/1/10 14:40	SYSTEM	envmon: MESR0 0x7ccaa044, MESR1 0x7ccaa044

At least they were recoverable, but I am wondering if we’re just waiting for a hardware failure. We are going to contact our current support (which is not EMC at this time) and see if they can shed some light. I assume that, whatever this is, it is local to the primary Data Mover. We do have the option of doing a failover to the other Data Mover if this one is about to have hardware problems.

After this reboot, the number of CIFS I/O operations has decreased significantly. That might indicate that this was causing problems with CIFS too, but I am not familiar with how CIFS deals with errors similar to what NFS experienced (and I assume the errors would be similar). This might have been pushing the I/O operations higher than normal, thus exasperating the overall problem. This is all conjecture now. At this point I need to keep an eye on the system and think about how we are to respond if this starts happening again. Right now, we are all good to go.

This could also just be some software bug. We find our Data Movers like to crash themselves periodically, though it is rare. We have taken to plan rebooting the system every couple of quarters in order to avoid that. I am unsure which problem scenario I would prefer more at the moment :-)

Weather Channel iPad App Epic Fail

Jul 23rd, 2010 | Filed under General

The iPad is wonderfultastic, but it is missing a good weather application. I found The Weather Channel Max for iPad application right away, and it was also free (not as in GNU, yes I know the difference). I gave it a try. I really liked it. The radar was one of the most important features to me, but it was a little buggy at first. They resolved that pretty quickly. I loved this weather application. I deleted it Tuesday.

Why? Great question that I can’t help but answer. First of all, the application is free. I expected there to be some advertising, especially since it was clear that it was “brought to me by Toyota” right from the start. Oh, they don’t let you forget it either. Toyota. Toyota. Toyota! I’ve used free software before that was supported through reasonable advertising and I survived. They key word here is “reasonable”. The Weather Channel Max for iPad application is light years beyond reasonable.

It started right away when I switched to the local forecast view from the radar. Occasionally I would get a full screen ad. The screen would fade to black and a video ad would play. Luckily it was only slightly painful to cancel the video. First of all, that’s not reasonable advertising. It’s completely intrusive to the overall experience, but it only happened about every second time I switched. There’s also an ad that plays in the local forecast view anyway, sans sound. Is that not enough? That’s what I would expect in this case. They aren’t having it though. They’ve got to go the extra 10 miles. Since I loved the radar the most, it didn’t affect me much. However, after some updates it started happening every single time I switched away from the radar. At first I thought the application was broken and I closed it even though I had experienced this ridiculous advertising method before. I’ve been a systems administrator and software developer for over a decade, and I’ve used some incredibly terrible UIs before. I am not a computing novice by any stretch of the imagination, so when this is confusing to me, it might be a general problem. Just saying…

I would like to know who thought that advertising this intrusive was a good idea? It’s totally insane. It ruins the application. I really want to know if it was The Weather Channel or Toyota. I want to know who to dislike more. Their advertising experiment has totally backfired on me. Now it seems that I don’t care for either one when once I did. I deleted this application and actually paid for one that’s not as good. That fact alone is why this application is a total failure. Total. Epic. Advertising. Fail. Period. Let’s not even consider what it takes to make someone actually write about the fact.

Let’s take a look at the application ratings at the time I left my review after deleting it:

*****		544
 ****		367
  ***		738
   **		1157
    *		3375

Let’s break that down:

  • Good or Great Count: 911
  • Indifferent Count: 738
  • Hate it or Don’t Like it Count: 4532

Interesting, but I’ll break it down to percentages too:

  • 14.7% of users have a favorable opinion.
  • 12.0% of users don’t seem to care.
  • 73.3% of users hate the application.

Even by any stretch of the imagination, those ratings suck. That’s just for the current version. I can only see the average for all versions, but it’s still shown as two stars. A number of people complained about the advertising. A lot of people seem to have stability problems, but I never did. Apparently there are a lot of reasons to dislike this application. Dislike is the kind version of me saying “hate this application”. Honestly, some of the positive reviews sound fake to me as well, but that might just be due to my view of the application. I don’t understand how anyone can like it.

This application deserves five stars as an example of how not to incorporate advertising into your free application. Especially when it actually drives people away and some of them actually buy a lesser quality paid application to escape the hell of your application’s UI advertising “feature”. I’d probably buy an ad free version of this application, but it was so annoying that I don’t know if I would really go that far. Did I mention how this was an epic fail?

It’s so annoying that it’s funny. I can’t believe someone thought this was a good idea. I would like to meet that person just to know how this view is even possible.

Oracle Sequence and DataMapper

Jul 14th, 2010 | Filed under Ruby, Ruby on Rails

I developed a Rails application at work and I needed to export that data for import into another system. The other system is running Oracle 11g. I am using MySQL. I read extensively on SQL Loader and provided the data that way, but I also wanted to provide another loader that would use sequences correctly, etc. instead of just assuming a reserved set of ID values for imported data. I also wanted this to work on any reasonable platform.

I need to use JRuby for another project already because I need to interface directly with some Java code, so I decided to use JRuby here as well so that it wold just work on multiple platforms. I also decided to go with DataMapper as it seemed appropriate for this job. I’ve not used DataMapper before, and I had one problem in particular with using a sequence for a primary key. There is a decent amount of documentation for DataMapper, but there could be more :-) I saw some posts about doing something like this:

property :id,                     Serial,   :sequence => 'ECA_REQUEST_S'

I kept getting an ArgumentError error however. I fixed that by putting the following in my code:


# This is required so that we can specify a :sequence argument for the Serial
# primary key on the ECA_REQUEST_T table. Normally this option would cause
# an ArgumentError error to be raised from line 826 in the following file:
#
# Note: prepend <Ruby Dir>/lib/ruby/gems/<version>/gems to all paths here.
#
# dm-core-1.0.0/lib/dm-core/property.rb
#
# It is clear in the following file:
#
# dm-oracle-adapter-1.0.0/lib/dm-oracle-adapter/adapter.rb
#
# that :sequence is definitely used by the Oracle adapter (plus, it actually
# does work - imagine that :-) I am not sure how else this is done in
# DataMapper. I'd love to know, because this seems weird, but it definitely
# works right.
DataMapper::Property.accept_options(:sequence)

The comments pretty much say it all. I don’t know if there is a better or more correct way to allow the :sequence to be specified like I needed, but the code above worked fine. Nothing else I did seemed to work, but I would like to know if I am just missing something.

This was quite an adventure. My first attempt went from MySQL to YAML to SQLite and then finally into Oracle. I changed the code and now the process goes from MySQL to SQLite and then finally into Oracle. I have to store the data in some sort of secondary storage in this case, so I can’t just go between databases directly. I also have to convert between the two schemas. It turns out that this was a great learning experience for using DataMapper, but it also helped out because I could compare the loads from both methods. I discovered a minor character set issue that I was able to deal with because of this work.

I am impressed with DataMapper, but I’ll still use ActiveRecord in my Rails projects for now. There could be some more documentation about dealing with legacy databases and the specific issue I ran into, but when isn’t that true? Don’t forget to copy your Oracle JDBC driver to the JRuby lib subdirectory as well. That hung me up for a few minutes, and once again I learned that sometimes you should just look at the source code the error message reports instead of jumping directly to Google. This was a ton of work, it probably won’t be used, but it was fun and I learned something. That alone is awesome. I love this stuff.

Are Web Fonts Worth the Trouble?

Jun 14th, 2010 | Filed under General

I am on the fence on this one. I had some font issues with a PDF I am generating with the pdf-writer gem for a Rails project. It’s not all that serious. The database stores UTF-8 text, and users cut and paste from Microsoft Word (or whatever). Those characters are not rendered correctly in the PDF file version. I really want to try and store whatever they put into the database properly, but I have to translate those characters to ISO-8859-1 to remove the characters that will not render correctly. On my Mac the characters in question are usually translated to something close enough, but on Linux the characters are simply removed. I’ve seen a number of people ask about this, and that’s where I got the solution from (basically redefining the PDF module’s Writer class text() instance method, among other steps to fully “solve” the problem). Don’t ask me how to solve this. What I did was given as the only solution I could find. Maybe Prawn will solve this someday, but it is not quite ready for my use.

This got me on some font exploration in an attempt to embed a font. I had also read a number of articles in my RSS feeds about fonts lately – in particular web fonts. I even bought some font tools to try some of this out. This got me nowhere however. I can’t just embed the fonts I generally have on my Mac without surely violating some license, and even though I just wanted to try to see if this was a viable solution – it turns out that it really isn’t due to what type of font I seem to be able to embed anyway. Even if I had a license to embed, I can’t get the resulting PDF into a state that meets the license requirements (for example, on my Mac Preview always asks me if I want to install the font from the file generated by pdf-writer). Therefore I just gave up on this path. And to be clear, I am strongly against violating any license whatsoever, so I will never go down that route.

After this work I became interested in web fonts. As far as I am concerned, the only real solution to the web font question is buying fonts from Fontspring. I don’t want my fonts hosted elsewhere because I don’t generally want to trust my site to some third party (you can probably guess what my general feeling on that oh so well defined concept of cloud computing concept is – let’s just say I don’t necessarily think it is all puppies and kittens – but I don’t discount it at all regardless of this feeling). I don’t use things like GitHub either, preferring to host my own Git repositories (not that anyone would miss my code at the moment probably, but fairly soon there will be something really useful there – at which point I’ll probably have to also host code elsewhere, but let me dream about my self-hosting utopia for now please :-)

Back to the story. Fontspring is fine. Most of the font licenses are reasonable. I am not a typographer, but some of those licenses are insane IMO. I had no idea fonts were this much of a pain. Seriously? I guess I can see why to some degree. Fonts are truly magical and typography is important, but using fonts on the web needs to change eventually I think. It’s a little crazy. I used a couple of fonts I bought from Fontspring on this site’s index page. They are nothing particularly special. My main goal was just to see how to do it, and then see if it worked on all the major browsers I have to test with (just about everything of any relevance). This all worked fine, but it took forever for me to figure out how to remove any unused glyphs due to this licensing restriction:

2. The Web Font must be subset to include only the glyphs necessary for displaying the web site.

Sigh. Seriously? I can’t just use the web font file that was provided? I actually have to remove all the glyphs not used on my site? That’s a nightmare. I can hear the typographers now, “We don’t care what you think. This is not overly difficult. I mean, we can’t have someone stealing a font you paid less than $20 for!” This is not easy IMO. Not at all. I’d love someone to tell me it is. The only solution I’ve found is to use Font Squirrel. I made sure I checked what seemed like the right options based on the other license restrictions, even though the source (the web font only version, not the desktop one) probably doesn’t have some of the possible features that would be questionable anyway. From my understanding, this is all fine.

Is this process worth it? Right now I have to say no. If I had some insanely great font (and expensive – because you know it will be even more insane to purchase) it might be. I am just starting to explore web fonts because I do believe that typography is important. Regardless, the process for dealing with this is a huge pain (and no, I don’t want to use some hosting service that charges even more insane prices based on various criteria). Maybe I just don’t get it… oh, but I think I really do in fact.

I am going to explore some free fonts as well. Maybe there are some good ones that will make me feel better. Obviously I am not a typographer, so I don’t want to come off as complaining about getting something for free. I am more than willing to pay reasonable prices for fonts if I don’t have to rely on someone else to host them. Reasonable to me is in the hundreds of dollars range for decent fonts if I can do what I generally want. I don’t think that’s unreasonable. I’d be willing to remove glyphs if we could do something more like “a reasonable subset” instead of “any glyph unnecessary to render your site” (and to note, I take that exactly as how it is stated, plus can we get a really good, easy to use free tool to do this – maybe that’s going too far and soon I’ll learn how it really is a lot easier than I thik… I just doubt it). I am a just an average potential customer, perhaps way less than average in the fonts world, and this seems to be somewhat completely lame to at least a degree.

I do like my less than $40 font collection (I bought two families). That’s not saying much in the world of fonts. In my limited investigation, it seems many fonts, if not most, cost an arm, leg, and then head – plus selling whatever is left of your soul for the ability to use some of the glyphs with 476 restrictions. Heh. Yeah, I just like to complain I guess.

Releasing [NSMachPort port]

May 23rd, 2010 | Filed under Cocoa, Mac OS X

When it comes to software development using Objective-C frameworks, I like rules. Rules tell me what to do because… they are rules. That’s what rules do. Let’s review some Apple rules on memory management in a reference counted environment (Cocoa for example). First we have the fundamental rule:

  • You take ownership of an object if you create it using a method whose name begins with “alloc” or “new” or contains “copy” (for example, alloc, newObject, or mutableCopy), or if you send it a retain message. You are responsible for relinquishing ownership of objects you own using release or autorelease. Any other time you receive an object, you must not release it.

Then you have the following rules derived from the fundamental rule:

  • As a corollary of the fundamental rule, if you need to store a received object as a property in an instance variable, you must retain or copy it. (This is not true for weak references, described at “Weak References to Objects,” but these are typically rare.)
  • A received object is normally guaranteed to remain valid within the method it was received in (exceptions include multithreaded applications and some Distributed Objects situations, although you must also take care if you modify an object from which you received another object). That method may also safely return the object to its invoker.
    Use retain in combination with release or autorelease when needed to prevent an object from being invalidated as a normal side-effect of a message (see “Validity of Shared Objects”).
  • autorelease just means “send a release message later” (for some definition of later—see “Autorelease Pools”).

Makes sense right? This is what I live my life by. I favor memory management over garbage collection because I want to make sure I understand it (and the iPhone OS doesn’t do garbage collection, so I am “thinking ahead” you could say).

In my trip down the road of ADC OCD, which entails reading the ADC documentation for a second time because I want to make sure I don’t end up looking like a fool when weaving the artistry that is programming, I came across the following in the “Threading Programming Guide” with respect to using NSMachPort objects to communicate between threads:

- (void)launchThread
{
    NSPort* myPort = [NSMachPort port];
    if (myPort)
    {
        // This class handles incoming port messages.
        [myPort setDelegate:self];

        // Install the port as an input source on the current run loop.
        [[NSRunLoop currentRunLoop] addPort:myPort forMode:NSDefaultRunLoopMode];

        // Detach the thread. Let the worker release the port.
        [NSThread detachNewThreadSelector:@selector(LaunchThreadWithPort:)
               toTarget:[MyWorkerClass class] withObject:myPort];
    }
}

This all seems pretty straight forward, but even using weak deduction powers one should immediately be concerned about the comment “Detach the thread. Let the worker release the port.” What? We’ve covered the reference counted memory management rules already, but let’s even pull this out from the same guide on memory management:

Many classes provide methods of the form +className… that you can use to obtain a new instance of the class. Often referred to as “convenience constructors”, these methods create a new instance of the class, initialize it, and return it for you to use. You do not own objects returned from convenience constructors, or from other accessor methods.

Note that the port: method of the NSMachPort class is really from the superclass NSPort and in general should fit that rule. Note that in these cases, as evidenced from the above, one does not own the object returned. One is not responsible for releasing it. Later the dispatched thread does release it:

+(void)LaunchThreadWithPort:(id)inData
{
    NSAutoreleasePool*  pool = [[NSAutoreleasePool alloc] init];

    // Set up the connection between this thread and the main thread.
    NSPort* distantPort = (NSPort*)inData;

    MyWorkerClass*  workerObj = [[self alloc] init];
    [workerObj sendCheckinMessage:distantPort];
    [distantPort release];

    // Let the run loop process things.
    do
    {
        [[NSRunLoop currentRunLoop] runMode:NSDefaultRunLoopMode
                            beforeDate:[NSDate distantFuture]];
    }
    while (![workerObj shouldExit]);

    [workerObj release];
    [pool release];
}

Note that workerObj does retain it however. Confused? I was. This doesn’t follow the rules. It might work, but is it really necessary? Is it that the rules don’t apply, or does that code just happen to work anyway even though it is not quite right? This does not bode well for my “reading the documentation for complete understanding of reference counted memory management in the Objective-C universe on Mac OS X” OCD issue. Of course, according to the rules, one should retain this thing if they need it. However, I feel there’s an extra release in there and it’s just sticking around due to the implementation details hidden in NSPort.

I decided to write a non-AppKit program to test this out. First, one needs to note the following about getting retain counts:

Important: Typically there should be no reason to explicitly ask an object what its retain count is (see retainCount). The result is often misleading, as you may be unaware of what framework objects have retained an object in which you are interested. In debugging memory management issues, you should be concerned only with ensuring that your code adheres to the ownership rules.

Well, I suppose there should be no reason unless you think the documentation example is wrong I guess. In fact, this example demonstrates the above point as well. Here is the test code:

#import <Foundation/Foundation.h>

int
main(void)
{
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

    // This should be autoreleased according to "the rules".
    NSPort* myPort = [NSMachPort port];

    // This should work the same way:
    NSString *myString = [NSString stringWithUTF8String:"foobar"];

    // This requires a release because we own it.
    NSString *allocString = [[NSString alloc] initWithUTF8String:"barfoo"];

    // According to the rules, the retain count should be 1 because it is
    // autoreleased, but it's actually 2. This indicates to me that it is
    // being retained (or was created with allocWithZone as the documentation
    // seems to suggest). This is actually all right as long as, by following
    // the rules, an autorelease pool drain (or release) would free it. Note
    // that this is the reason that it is hard to ask for a retain count and
    // know what's really going on (yet I am doing it, sue me).
    NSLog(@"myPort retain count = %u\n", [myPort retainCount]);
    // But wait, we are supposed to invalidate it before we are done though.
    // Let's do that.
    [myPort invalidate];
    // Now its retain count is 1. Do we need to release this or will it be
    // autoreleased? The example code I've seen shows it being explicitly
    // released, but according to the rules, this should be an autorelease
    // situation.
    NSLog(@"myPort retain count = %u\n", [myPort retainCount]);

    // myString does the right thing.
    NSLog(@"myString retain count = %u\n", [myString retainCount]);

    // allocString also has a retain count of 1, doing the right thing again.
    NSLog(@"allocString retain count = %u\n", [allocString retainCount]);

    // This will release the autoreleases. Note that one should really use
    // [pool drain], but without garbabe collection (or the possibility
    // thereof), this is the same thing.
    [pool release];

    // Now any autorelease calls should be dealt with and it should be
    // gone. I am surprised this doesn't cause a segmentation fault, and
    // the retain count is actually nonsense. If you try to call this again,
    // it will cause a sementation fault. One should not rely on retainCount
    // making too much sense (I mean, see the behavior here), but it is
    // somewhat useful.
    NSLog(@"myPort retain count = %u\n", [myPort retainCount]);

    // But myString is gone. Calling this will lead to a segmentation fault!
    //NSLog(@"myString retain count = %u\n", [myString retainCount]);

    // allocString is still here though because we've not released it yet.
    NSLog(@"allocString retain count = %u\n", [allocString retainCount]);

    // Let's be sure to follow the rules, even though this process will clean
    // itself up.
    [allocString release];

    // Oh yes, please exit appropriately people :-) Not sure I would consider
    // this all a success though.
    exit(EXIT_SUCCESS);
}

I’ve pretty much explained it in the comments, but let’s see what happens when it runs:

2010-05-23 17:35:01.103 main[965:903] myPort retain count = 2
2010-05-23 17:35:01.105 main[965:903] myPort retain count = 1
2010-05-23 17:35:01.106 main[965:903] myString retain count = 1
2010-05-23 17:35:01.106 main[965:903] allocString retain count = 1
2010-05-23 17:35:01.107 main[965:903] myPort retain count = 4294967295
2010-05-23 17:35:01.107 main[965:903] allocString retain count = 1

Ah, so this really seems to confirm that one does not need to release the NSMachPort object. One is supposed to invalidate it when it’s finished, which drops the retain count from 2 to 1, and then releasing the autorelease pool seems to do the rest (though the retain count is nonsense after that, I am surprised it did not segmentation fault). This seems to confirm the rules. So, I believe the original threading example code is just wrong. I’d love for someone to clean that up or double check (and explain things if I am actually wrong).

What’s also interesting about the results, if we assume that this is working according to the rules (which I claim it is), is that the retain count is originally 2 after creation using the port: class method. The object is actually retained somewhere in the method I believe. This is fine, as long as you do what you are supposed to do (invalidate it), the autorelease mechanism seems to really take care of cleanup – so you should not need to release it yourself. Doing so could be problematic I believe. It is in the autorelease pool, and then they release it? Of course, the example code does not call the invalidate: method either, so maybe that’s why it does not blow up.

This NSPort documentation states:

When you are finished using a port object, you must explicitly invalidate the port object prior to sending it a release message. Similarly, if your application uses garbage collection, you must invalidate the port object before removing any strong references to it. If you do not invalidate the port, the resulting port object may linger and create a memory leak. To invalidate the port object, invoke its invalidate method.

They should only mean using release (not invalidate, that’s needed) if you create it with a method that meets this requirement in the fundamental rule of memory management in reference counted environments. The port: method (which does not fall into the fundamental rule unless its return value is retained) is documented as:

port
Creates and returns a new NSPort object capable of both sending and receiving messages.

+ (NSPort *)port

Return Value
A new NSPort object capable of both sending and receiving messages.

Availability
Available in Mac OS X v10.0 and later.
See Also
+ allocWithZone:
Related Sample Code
SimpleThreads
TrivialThreads
Declared In
NSPort.h

There is nothing above that makes me believe the rules have changed. I don’t believe that the rules in this case are the same as trying to get App Store approval after all (who knows what magic behind the scenes rules are going on there at any give time – don’t get me wrong, I love the App Store as a customer though :-)

It is interesting that I only found one other person who asked about this. I found no answers to that question. I totally obsess over this type of thing because it is important. Writing a threaded implementation is hard enough, but to add any possible confusion about the rules does not help at all. If I were to implement communication between threads using an NSMachPort, I would try to do it the way I know should be correct according to the rules. Now the only thing I know is to do so, but be on the lookout for breakage because it might not work like the rules say. Still, I argue that it really does :-)

Man, someone straighten out that documentation! Of course, I’ll probably use an operation or GCD instead, but it’s still good to know the rules are rules one can live by. I forgot to mention this code example that’s further down in the “Threading Programming Guide” about using an NSMessagePort object:

NSPort* localPort = [[[NSMessagePort alloc] init] retain];

// Configure the object and add it to the current run loop.
[localPort setDelegate:self];
[[NSRunLoop currentRunLoop] addPort:localPort forMode:NSDefaultRunLoopMode];

// Register the port using a specific name. The name must be unique.
NSString* localPortName = [NSString stringWithFormat:@"MyPortName"];
[[NSMessagePortNameServer sharedInstance] registerPort:localPort
                     name:localPortName];

A retain after an alloc? At this point I have to assume the documentation is simply crazy: [sanity release];

iPad Revolution

May 15th, 2010 | Filed under Hardware

I finally broke down and bought an iPad. I wasn’t going to buy one. I thought to myself, “You always have either your 17-inch MacBook Pro or your 15-inch MacBook Pro work laptop anyway… what do you want with an over-sized iPod?” Well, it’s not that simple of course! For a long time I did want an iPod, but what I really wanted amounted to an iPad actually. I tried an iPad out in a store. Ten minutes later I knew that I had to buy one. It really was that quick. I got lucky in that I could not find one in any store right away. I actually had to order online and wait a week or so to get the device. Why was that lucky? Due to not being able to get one that very day, I was able to resist the urge to buy one on the spot long enough to actually order a 3G model because I had no other choice. Again, what I really wanted was access to the Internet on a device like this from virtually anywhere (well, anywhere on AT&T’s network anyway). This has all worked out in the end.

Is this device revolutionary? Yes. Others have written much better essays to this effect. The user interface is sheer genius IMO. Let’s take the simple example of viewing a web page or even a PDF file viewed in the web browser. You can simply double tab on a div or a column of a multi-column PDF file and it zooms in perfectly. You need to move the zoomed view to the next column? Oh yeah, that’s no problem – it knows that you really want to go in only one direction, so even if you mess up a little after starting, it basically “locks” the movement in that direction. Even the keyboard interface is smart enough to know that my switching to numbers in order to add an apostrophe means that I immediately intend to go back to the letter interface to type more. These seem like small issues, but they make using the device a wonderful experience. The auto-correction works flawlessly for me as well. I am typing this article on my iPad. So far the only thing I wish the keyboard had is arrow keys, but I already have a wireless keyboard if I need it.

The apps I have are brilliant as well. The user interface to Apple Mail just works for me. NetNewsWire’s user interface is perfect. I also love the Mac version as well. The battery life is amazing. The user interface is a joy to use. This will not replace my MacBook Pro systems, but it has earned a place as a new required device. The iPad is changing the way I access the Internet. That is why it is a revolutionary device to me. I live online almost all of my waking hours either because I am doing development or because I am simply on a network as a system administrator. I am telling you that this device is a revolutionary thing that is a brilliant example of excellent design. This is why I am sold on using Apple devices. They simply build the best devices. No one else even comes close IMO. I even used to joke about paying the “Apple tax” until I actually bought a MacBook Pro. Yeah, those days are long gone.

Apple isn’t perfect. My last post complained about Aperture 2 for example. No one is perfect, but Apple is the best when it comes to hardware and operating systems for my own personal use, hands down. I even agree with Steve Jobs about Flash, not because I like Apple, but because I just agree. When that whole thing came up, I immediately sided with the Adobe camp. Within about two hours I had reconsidered and walked away from the debate. I really don’t care. I love Objective-C and the frameworks (and development environment) on the Apple platform, it’s their device, and regardless of the widespread use of Flash, Flash sucks (let’s not even get into security). I love Photoshop though. That’s a winner. Adobe can prove the usefulness of Flash on the mobile platform before I will care. I am not missing Flash. ClickToFlash was a wonderful addition to Safari on my systems, and I am not missing Flash on my iPad. I think that just about says it all with respect to my stance on the whole issue.

Everyone else is free to execute a poor iPad imitation with Flash support, and it would even be better from a competitive standpoint to make a real competitor to drive innovation (though it doesn’t seem Apple needs help with innovation in this case). However, I feel it is not very likely there will be a real competitor anytime soon. I’ve not seen good competitors in any of Apple’s focus areas for hardware devices. The competition is not bad, but it’s not great either. I require great devices with elegant design. I want devices with incredible build quality that feel solid, look beautiful, and just work. I don’t care if they cost thousands more than junk. I don’t want junk.

Good job on the iPad Apple. It is what I almost knew I’ve been looking for. Now that’s crystal clear.

Apple Aperture 2 Cop-out

Jan 1st, 2010 | Filed under Mac OS X

It seems Apple has a cop-out for the Aperture 2 problem I found (aside from slowness now):

Apple Aperture 2 Excuse

Wow. Are they serious? Note the date on that is October 14, 2009. Let me see if I get this straight. So, there’s a bug in your software (or OS, it used to work for me before just fine in Leopard), and the excuse is – “Oh, we never really supported grayscale!” Uh… then why does the UI seem to get it otherwise? Wait, was it that you just went that extra mile to make it work, but just not in a supported way?

Just in case that link ever goes away, let’s recap the excuse:

Aperture is designed to work with images from digital cameras which use an RGB color space. Non RGB images, such as grayscale TIFFs, grayscale PNGs, or RGB-A images (with alpha transparency) may not render correctly.

Are they serious? I sent in feedback and stated that I expected my money back.

Aperture 2 For the Lose

Dec 30th, 2009 | Filed under Mac OS X

I’ve run into my first Apple application failure. I really like Aperture 2. I’ve spent a fair amount of time moving my photographs into it. I didn’t even try Adobe Lightroom 2 because I liked Aperture 2′s UI so much after trying it. I rushed out and bought it, and life was good – life was good in Leopard. Well, it wasn’t that good actually. I discovered a memory leak with Nikon RAW images. I say discovered because, from what I recall now, various other people discovered it well before me and it still was not fixed yet. Go /var/log/system.log file! At least that was something I could work around.

When the Leopard of a snowy nature (as someone asked me about it once) came out, I had it installed the next day. Aperture 2.1.4 is supposed to work in Snow Leopard. It does, but it’s slow when going full screen. I put in a bug report about that. Then I realized that I could no longer make adjustments to large grayscale TIFF files! The image never fully loads (color files are fine, but again – way too slow). This was not a problem in Leopard. I put in a bug report about that and questioned the quality control here. It’s like Aperture 2 is some kind of play thing and not a flagship product. This makes me question other products, like Final Cut. What if I were to move to that someday and have the same issues? I realize that Snow Leopard is new, but Aperture 2 has been updated to run in it. We are at 10.6.2 here people. What’s the problem? You guys fix that memory leak yet?

All right, I’ve had enough. I downloaded Adobe Lightroom 2, and guess what? It’s awesome. It works with my large grayscale TIFF files. It is not slow. Adobe now has my money, when I buy it anyway – and I am not sure if Apple fixing this before my 30 day trial ends will make a difference. I just spent hours correcting some photograph TIFF scans I mad previously fixed in Aperture 2.

Apple makes a great OS, and all of their other applications work fine. Aperture 2 sucks in my experience. It even ended up wasting a ton of my time already. What a huge disappointment.