moore's law and more

In April, I attended CCSC-NE 2007 and the keynote on Saturday was given by Dr. Mary Jane Irwin who is currently the Chair in Engineering in the Penn State Department of Computer Science and Engineering (among other things). Her talk was titled, “Impacts of Moore’s Law: What Every CIS Undergraduate Should Know About the Impacts of Advancing Technology” which essentially discussed three basic issues: the use and advancement of multi-core processors, the supplying and conservation of power and the “inevitable increase in hardware faults”. (The talk’s abstract on the main conference page as well as access the presentation slides.)

Moore’s Law was a conjecture that the number of transistors contained on a processor chip will double every two years. This has essentially been the case and has further been extrapolated, by Ray Kurzweil, to identify the “quadrupling of computational power” (see Kurzweil’s “The Law of Accelerating Returns”, which is an interesting read itself). As is probably quite obvious, and has been explicitly noted, the shrinking of transistors can only progress so far since atoms are only so big. (Kurzweil, in his article, identifies 2019 as the target date.) The inclusion of more transistors gives rise to power and heat problems. As Irwin points out, developing multi-core processors is one alternate way to make improvements.

Just over two years ago, desktop PC’s were made available with Intel’s first dual-core chips. Quad-core chips have also been made available for the general user. Research is still ongoing. Irwin references Intel’s prototype of an 80-core chip announced in 2006 but it’s more proof-of-concept than anything practical (…or perhaps not – interesting content starts about 45 seconds in).

(Note: I’ve been referencing Intel technology in this post. This is not to slight AMD. In looking for relevant articles, it seems that Intel is currently ahead in the multi-core game. Intel released the general purpose quad-core processor around the first of the year whereas AMD’s quad-core is set to release in August 2007. From my point of view it’s a Ford-Chevy argument as to the merits of each and I don’t particularly have a preference for my general computing needs.)

There are several hardware considerations when moving to a multi-core system. Irwin points out a few but such concerns seems to be closer to the hardware/firmware level. As I am more interested in software, my concerns are at the operating system level and up. So, how can applications benefit from multi-core systems?

First, if there are multiple cores, application can be executing within each core without having to be swapped off the processor to achieve multi-tasking. In this sense, the applications can run faster. The applications, though, will still have to vie for use of other resources. Plus this doesn’t really improve the application performance itself. If both processors are free, a multi-threaded application will be able to take advantage of both processors by executing two threads at the same time. The question remains, however, has the design of the application really taken advantage of the parallelism at hand?

Even if operating systems support the multi-core processors directly, in the sense of providing APIs for developers to control how the application distributes itself, developing applications with generic parallelism is difficult at best. Further, if the 80-core processor is truly on the near horizon the applications developed for parallelism at the duo and quad core processors may not scale well to chips with more cores. (I have a hard time believing the application developers are going to invest time worrying about scalability when time-to-market is of the essence. In fact, many applications probably won’t be modified. For example, will Microsoft Office really need to be parallelized?)

I anticipate most headway, and it’s probably overly obvious, will made in the gaming domain. The developments there will be interesting. I’m not all that familiar with game development but with the advanced AI going into non-player characters, the advantages of being able to run a bot as a unique thread on its own processor seems to have significant implications.

Another family of applications that would benefit by parallelism would be in image and video editing – especially as the resolution capabilities of cameras keeps improving, i.e., the file sizes increase. A simple divide and conquer technique would farm segments of the media for transformation to different processors; or, perhaps, the transformation on multiple media files. The application of complex filters to collections of the media files, perhaps even a collection of frames in a video, could also be realized in a pipeline.

The last type of applications that would benefit – but not directly benefiting the average user – would be in the massively distributed computations such as SETI@Home or GIMPS. Given the inherent distributed nature of the application, the rate of number crunching of each would be accelerated. With each participant having, effectively, multiple machines, the applications would have to take advantage of the computing power. Either the application distributed to the end users would have to be parallelized or the application would have gather a larger chunk of the problem and distribute it.

In the case of SETI@Home, a chunk of the problem is distributed to the end user. When the application finishes its computation, it sends the results back to SETI and receives a new piece. Here, on the user’s end, the application could simply request as many chunks as there are cores. For GIMPS, the user currently installs the application and manually requests and reserves a random Mersenne prime for analysis or requests a specific one. Since the search is not automated, the application could be parallelized or the user could run multiple instances of the application and work on several Mersenne numbers at once. Since the two approaches are probably relatively equal in terms of raw computations (modulo coordination of the parallelized effort), there is no reason to select one over the other. Perhaps this could be left for the user to decide – faster results on one or the option to have multiple candidate primes reserved.

Distribution/parallelization is just the starting point in development. More complex problems arise when optimization is important. The previously mentioned application families would consider optimization vital. Optimization can occur at several levels. There is, of course, the algorithm itself but there are other considerations, three of which are memory management, core affinity, and priority management and multitasking.

Memory management would have several facets. First, there is the issue of maximizing the L1 cache hits since L1 cache is closest to the core (L2 and L3 caches are increasingly slower with RAM trailing the group). The other issue with multi-core systems is minimizing the data synchronization needed between threads executing on different cores.

Core affinity refers to trying to always run a given thread on the same core. Moving threads from core to core incurs an overhead penalty as the thread state must be transitioned. By default, threads are typically run on the same core for this reason. However, as the chip tries to both balance load and permit threads to execute, it may opt to take advantage of available cores. There is also the question of whether or not it is possible – or desirable – to lock a thread down to a specific processor ensuring no other thread will run on it.

When multiple threads are running, the thread with the highest priority preempts threads of lower priority. On the other hand, when two threads have the same priority they either share the core, taking turns executing, or the core utilizes run-to-completion-or-block scheme. The former is typically what is used for general purpose machines.

Granted, these issues are more concerns for applications having significant optimization needs and/or real-time systems. But for advanced applications, such as those utilizing AI to perform real-time analysis of massive amounts of data, it might be something that could be utilized. For example, applications that analyze and predict stock market trends or analyze meteorological data to detect severe weather might benefit from such controls. And, if the 80-core processor is on the near horizon, the cost to handle such processing-intensive computations gets significantly cheaper.

No comments:

Post a Comment