From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: alt.os.multics,alt.folklore.computers Date: 15 Jan 1995 23:48:11 GMT Subject: Re: old mainframes & text processingThe original method for CMS handling commands was/is sorta funky. Everything was tokenized into 8 character units concatenated together and then a system call was made. It turns out that this was the procedure for kernel system calls or commands ... or just about anything. The kernel would then attempt to resolve the first 8character unit as an exec (aka shell script) in search path, binary executable in search path, or kernel functions.
standard system has a standard command abbreviation table ... but users could augement it with their own specification.
there was some tricks regarding invoking kernel calls directly from command line (or exec/shell-scripts) by typing appropriate binary data.
early on cms ran into some scale-up issues and started doing sorts on file directories and leaving around a status bit indicating whether the directory was sorted or "dirty". If sorted ... simple filename search was performed with binary search (rather than linear). Also in the early '70s, (for performance) cms added a 2nd type of kernel call api ... which would only resolve to kernel functions (actually a kernel function branch table) ... instead of alwas doing the generalized search mechanism. lots of applications then got rebuilt using macros mapped to the new performance implementation (this somewhat reduced the requirement for the file directory sort).
The original exec(-1) (shell script) processor was somewhat more sophisticated than dos command processing ... but was augmented in the early '70s with "exec-2" and in the late '70s with rex (which turned into rexx in the early '80s).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Subject: pathlengths Newsgroups: comp.arch Date: 10 Mar 1995 05:54:57 GMTin my youth i was fascinated with taking 1000 instruction pathlength and turning it into zero ... or some such thing ... i.e. reorder several thousand lines of code so that functions that were high-use critical path became a side-effect of executing other things in a particular sequence. as a result i could typically page fault, page schedule, schedule, task switch, page i/o complete, task switch back, etc in 1/3rd to 1/20th path length of any comparable system. Downside was that it could be maintenance nightmare ... 5 or 10 years later, scheduler might stop operating at less than optimal fair share because of perturbation in various side-effects due to code changes to random other places in the system.
370 virtual machine ... created a very clear-cut api for the microkernel. also extensive instrumentation was constantly exposing the cpu utilization of various parts of the system. between the clear-cut, non-ambiguous api ... and system function definition ... along with a (somewhat subliminal) perception that any cpu utilization by the "kernel" was bad ... there was constant striving to force the kernel path length to zero ... a subculture that I don't believe exists/existed in any other operating system (except possibly some real time controllers). In most other systems, kernel cpu utilization seems to be assumed to be part of the cost of running an operating system; it could somewhat be dismissed and attention refocused on other "more important" issues like gui interfaces.
another aspects of instrumentation was I designed & executed a suite of over 2000 benchmarks to validate the dynamic adaptive feedback scheduler (for an extremely wide range of load & operational environments, supporting fair share, non-fair share, dynamic adaptive to the resource bottleneck, etc); i.e. change a process priority by a single digit (i.e. nice'ing) would (alwas?) result in very predictable change in resource consumption across a wide-range of configurations and loads. It took me 3 months elapsed time to run the benchmark validation suite ... in preparation for releasing what essentially was just a 6k instruction product feature.
From: lynn@garlic.com (Anne & Lynn Wheeler) Subject: pathlengths Newsgroups: comp.arch Date: 10 Mar 1995 17:18:30 GMTactually i should qualify the 6000 instructions for the dynamic adaptive resource manager. At the time it was decided that they wanted me to release the resource manager I had been doing a 5-way smp project ... so when I bundled the 6000 instructions, it actually consisted of:
1) lot of kernel restructure for smp
2) restructuring of kernel serialization to eliminate all cases of zombie processes and elimination of all known cases of kernel failures due to sequencing problems
3) bunch of fast path stuff
4) one of my page replacement algorithms from the '60s which was also smp'ed
5) dynamic adaptive resource management
... when they got around to releasing smp in the regular product 85% of the resource manager instructions was absorbed into the base product ... leaving less than 1000 instructions in resource manager feature.
... as another aside with regard to tss/360 ... I did a side project in the early '70s that analyzed 360/370 program code and was run off the assembler output. Around '84 (for nearly 10 years, the tss/370 project had been operating with cast of 10s rather than thousands and the vm/370 group had done the opposite), I did structural comparison of the tss/370 kernel against the vm/370 kernel (using my application). By that time, tss/370 kernel was achieving a succinctness and compactness that was more characteristic of the cp/67 kernel (and the vm/370 kernel had evolved into a much more complex organization); i.e. even with the 370 virtual machine API firmly implanted in everyone's minds ... strength of implementation focus became diffused as organization grew.
... minor trivia question ... from a program analysis standpoint there was a major difference between the output of the standard 360/370 "H" assembler and the tss/370 assembler. In the "H" assembler, data-space addresses weren't taged so some structural analysis became somewhat ambiguous. The tss/370 provided a tag'ed identifier for each data-space identifier ... removing a lot of ambiguity from structural analysis.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
From: lynn@garlic.com (Anne & Lynn Wheeler) Subject: Re: Why is there only VM/370? Date: 1995/04/05 Newsgroups: comp.archsupporting 2nd order paging ... there actually is two forces at work here contributing to bad behavior ... the most obvious is that two level paging is redundant ... the less obvious is that running a LRU under a LRU violates the LRU assumption ... the 2nd level system becomes to exhibit MRU behavior rather than LRU behavior (i.e. LRU assumes that the pages used the most recently are the ones most likely to be used in the future, a 2nd level LRU algorithm can actually start to exhibit inverse behavior, the least recently used page is the one most likely to be selected next).
for interactive, the group somewhat controlled both above the line (CMS) and below the line (CP) implementation and went a long way towards eliminating/optimizing their modular behavior.
Some effect was also done on the VS1 operating system to eliminate duplicate operations when running 2nd level.
However the MVT->SVS->MVS genre had little such work done. A big hit came in the MVT->SVS transition. Effectively the VM kernel had to implement/shadow all the status bits that existed in the real hardware definition ... as well as simulate each priv. instruction executed code running in the virtual machine. In the MVT->SVS transition, the virtual machine went from non-relocate to relocate ... which exploded the number of bits in the hardware definition by a couple orders of magnitude (before it was little more than the regs & the psw ... but now it also included all the virtual-virtual relocate tables).
Another effect was that the number of priv. instructions executed by the (virtual) operating exploded in the MVT->SVS transition. This could be seen in a non-VM environment on how long it took the same job to run under MVT vis-a-vis SVS (drastic increase in kernel pathlength). From a VM/370 standpoint that drastic increase in kernel pathlength also included a significant increase in ratio of privilege instructions (requiring simulation) to total instructions.
In the SVS->MVS transition period ... some machines appeared that provided virtual machine hardware "assists" that would handle some percentage of privilege instruction "simulation" directly in the hardware (i.e. implement decode of the instruction according to the virtual machine rules, not real machine rules). This effectively reduced the simulation overhead to zero for those instructions. However, the performance gain from hardware assist was more than offset by further bloat in the MVS kernel and another significant increase in ratio of (non-assisted) privilege instructions to total insturctions executed.
In '68, a MFT job-stream running about 40% cpu busy ran 2.5 times longer executing under VM as it did standalone. By the summer of '68, I had reduced the cp pathlength so that the same job-stream only ran 1.15 times as long. Running the same job stream under MVT bumped the time elongation up to 1.5 times as long. Along came SVS and execution increase it ballooned back up to over 2* increase. When hardware assist appeared, backlevel MVT ran almost at 1.0 (almost no increase), while SVS tended to be in 1.5 range. MVS blew it out of the water again ... although VS1 on some machines actually ran faster under VM than they would standalone.
In areas of generalized operating system function (scheduling, paging, dispatching, file i/o, etc), somewhat orthoganal to virtual machine simulation, CP tended to have 1/10th the pathlength of the generalized operating systems that might run under it (in addition to having better algorithms). As a result, the cp/cms timesharing combo (along with careful attention to minimizing duplication of effort above & below the simulation line) tended to blow away any of the other operating system offerings (i.e. TSO running on MVS running w/o VM on native machine).
In many respects the attention to pathlength was a direct result of people being able to show performance running with & w/o VM. For some reason, the MVT->SVS->MVS bloat was significant greater than any VM vis-a-vis non-VM difference ... but it never came under the same level of scrutiny.
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: What is an IBM 137/148 ??? Date: 22 May 1995 21:02:09 GMT155&165 had 80ns cache, relatively slow main memory (i think 2mic), and no virtual memory. there was also 145 non-cache machine that shipped with virtual memory hardware. 145 was vertical m'code, 155 was horizontal m'code and 165 was hard wired.
when 370 virtual memory was announced along with 158 & 168 which had main memory that was around 4-5* faster, if i remember right the 165->168 also decreased the 370 instruction implementation from something like avg. of 2.1 to about 1.6 machine cycles per instruction. 155&158 were both about 0.9-1mip machines running "in-cache", but cache miss was much more expensive on 155 (because of slower memory). 145 was about .3-.4mip. 168 was about 2.5mip machine.
the 155&165s had to have hardware upgrades to support virtual memory (in fact one or two features were left out of virtual memory announcements because of difficulty in retro-fitting the additional signal line to the 165 machines already in customer shops).
it was also possible to swing out the 155 front panel and flip a switch turning off the cache. for heavy i/o workload, non-cache 155, 145 & 360/67 all had about same thruput running cp (i.e. modifed version of cp/67 running 370 virtual memory mode).
the 138/148 where about 3 years later ('76?), had some amount of
faster memory; buts also lots more m'code space ... into which were
coded "operating system performance assists". The 148 also had a lot
of work done on floating point ... significantly reducing the
difference between floating point & fixed point performance that had
been characteristic of ibm machines up until then. the 148 had around
128k for m'code ... and had around 6k left and were looking for some
way to utilize it. I posted the test run results to this group about a
year ago which were used to select pieces of the kernel that got
migrated to m'code (someplace in files at
ftp.netcom.com/pub/ly/lynn).
vm/370 started out with a model number table that was used at boot time to adjust a number of parameters, theoretically based on processor performance. For the resource manager performance enhancement, I replace it with some boot benchmarking code that attempted to figure all that stuff out dynamically.
there was 360/95 & 360/195 which were the top-end floating point machines. main work done on the 360/195 to create a 370/195 was the 370 non-virtual memory instructions (same as in the original 155 & 165) and instruction retry.
A lot of work was put into all the machines in the 370 line to recover from non-reproducable, transient hardware errors (I vaguely remember that given the number of circuits in the 195, their MTBF, and the speed at which the 370/195 ran; that a hardware error was expected to occur something like once a month, the addition of some of the 370 instruction retry for the 370/195 was suppose to mask that).
wasn't really easy since it had 64 instruction pipeline and imprecise interrupts. state configuration for what could be retried (& how) was difficult. imprecise interrupts contributed to the lack of virtual memory hardware on the 370/195.
there was also a prototype dual-istream 195 machine that was never produced. the pipeline didn't have speculative execution so a branch would drain the pipeline (except for special case where branch target that would loop in the pipeline); some investigation was put into building a simulated SMP 195 that appeared to be two processors ... but with only a small additional hardware bump for 2nd instruction address and 2nd set of regs. For many operating environments it would provide nearly twice the thruput for <5% increase in hardware.
late '70s saw the introduction of the 303x line. Main feature of the 303x line was the channel controller box. The 3031 was effectively a 158 with new covers and a channel controller, the 3032 was a 168 in new covers & a channel controller. The 3033 was a new machine. The 168 effectively used technology that was about 4 circuits/chip. The 3033 was built with newer 20 circuits/chip ... although it was originally layed out just using the 168 logic (i.e. only 4 circuits/chip being used) ... which resulted in about 20% performance improvement (because of somewhat faster chip). Some last minute redesign of the logic in critical places up'ed the improvement to closer to 50% (utilizing more than 4 circuits/chip and getting more intra-chip processing).
The 4331/4341 were introduced about the same time to replace the 138/148.
Some trivia information ... the 303x channel controller box was basically a 158 with a different horizontal m'code. It also had some additional I/O feature support. About that time, I had done a custom modified operating system for the disk engineering lab. Their typical environment were 6-10 "test cells" connected to a machine. Problem was that operating a single test-cell engineering box had a habit of "crashing" all of the mainframe operating system; typically within 10-15 minutes. As a result, testing had to be done on a stand-alone, single test-cell at a time basis, using pretty rudimentary software. Trick was to implement an absolute bullet-proof replacement I/O subsystem that would support all test-cells operating concurrently. Turns out there were a couple tricks you could play with the 303x channel controller sending standard ops in special sequence that would help fence off a test-cell that had gone bezerk. Also if you hit all channels on a channel controller with a clear channel in quick succession, you could cause it to reboot. Didn't help with 4341 operation/testing ... just had to grin & bear it as best as possible.
The 155/158 & 165/168 were done by different teams of engineers at different locations. The next machine after the 303x was the 3081 done by the 155/158 engineering group ... followed by the 3090 done by the 165/168 engineering group.
The 3081 was introduced as an SMP-only machine (2-way and the 3084 4-way). However, it wasn't a "mullti-processor" in traditional IBM terms. Up until then, the 360, 155/165, 158/168, and 303x SMPs had all been independant machines with independant power-supplies, channels, etc ... but with hardware interconnect to share memory bus and synchronize caches ... and they could be partitioned and operated as independent machines. The 3081 had two processors packaged in the same box sharing a lot of the same components.
At the low-end of the original 370 line were the 115/125 which had novel architecture (for 370). The machines were layed out with a common bus which could be shared by up to 9 microprocessors. In the 115 all the microprocessors were the same. Different microprocessors typically were dedicated to different functions and had different m'code loads. The primary difference between the 115 & the 125 was for the 125 there was a unique faster microprocessor running the 370 m'code load. As far as I know all configurations shipped to customers were limited to one processor running with a 370 m'code load. I did a cp design that would support a 125 being configured with up to five 370 processors ... also utilized programmability of the other processors to offload a lot of kernel pathlength. When that project got killed, rolled the design over into standard 2-way smp leaving out a number of special features (since i wasn't going to be allowed to modify the m'code). I did a custom release 3 version for HONE (started out with eight 2-way SMP operating in single-system image mode with common disk farm, i.e. this config had stuff that if any complex was taken offline &/or failed, online users were reconfig'ed onto remaining available machines). Same SMP design shipped in release 4 product.
The 5-way 125 and 148 activities were going on about the same time and for some reason the 148 group viewed the 125 activity as competitive. As a result in the shoot-out meetings I got to be both the shooter and the target on both sides of the table.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
From: Lynn Wheeler <lynn@garlic.com> Subject: Re: SMP, Sequent Computer Systems, and software Date: 05/25/1995 Newsgroups: comp.arch.. for cache based machines ... various forms of affinity could also boost thruput ... interrupts occurring on processors in the middle of doing random other things ... vis-a-vis interrupts going to processor which already had been processing interrupts ... and handed off higher level (transaction) processing to machine already doing higher level processing has been able to show a 50% increase in MIP rate (i.e. better control on when asynchronous interrupts might occur as well as what processor was already doing what). Structure of fine-granularity locks, inter-CPU processing hand-off, and dynamic adaption has impact on being able to achieve such goal.
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.lang.asm370,alt.folklore.computers Subject: Re: 1401 overlap instructions Date: 03 Jun 1995 20:36:48 GMTmy first student programming job was to implement the 1401 mpio function (front-end to 709) on 360/30 ... this was summer of '66. 1401 had 3 (7-track) tape drvies, 2540 reader/punch & 1403N1 printer.
incoming cards were either bcd (subset ebcdic) or binary (used all 80 columns and 12 rows; read column binary with top & bottom 6 rows going into different bytes, 160 bytes total). had to do a reads & feed/select-stacker separate. i couldn't keep an output tape, input tape, card reader, punch/printer all running with card reader at full speed while using os/360 (release 6?). wrote my own multi-tasker, & interrupt handlers; took over the interrupts from the operating system ... and then could run card reader at full speed ... while also processing output stream. Used as much of memory as possible for elastic buffers.
my assembly program ran to about 2000 cards (about a box). w/o any macros ... took about 30 minutes to assemble program and generate executable. did version with macros and switch that ran stand-alone or ran under os. problem was that macros really slowed down the assembler; a DCB macro took six minutes elapsed time; five DCBs for the two tapes and three unit record added another 30 minutes to the assembler elapsed time.
After a while, I found it was frequently faster to repunch/multipunch (026) a 12-2-9 TXT card with patches than it was to re-assemble ... somewhat arcane skill being able to read punch holes in 360 binary decks, fan a binary deck ... pick out the card with the address of the instruction(s) needing patching and dup/repunch new card with fixes.
felt somewhat silly when i finally discovered .REP cards.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
From: Lynn Wheeler <lynn@garlic.com> Newsgroups: comp.os.linux.development.system,comp.arch,alt.folklore.computers Subject: Re: Who started RISC? (was: 64 bit Linux?) Date: 13 Jun 1995 14:35:55 GMThad to been prior to '78, i was at conference '76 presenting 16-way smp design and the 801 group was presenting 801 and cpr operating system.
i remember because somebody from the 801 group flamed our presentation about being able to modify existing kernel to support 16-way smp because "they had looked at the existing source and the control blocks didn't contain any fields that would support smp". I guess they never heard of modifying control blocks.
in any case, i returned the favor by flaming their hardware deficiencies. their response was that they were writing a brand new closed operating system and that the hardware deficiencies were specifically selected as being performance/software trade-offs. The closed system implementation would do authentication and authorization checking at compile/load time ... and as a result runtime application code would have inline "supervisor" code (w/o kernel calls) that would compensate for the hardware deficiencies. 801 processors would never be supported by any but the closed operating system with application inline kernel code (i.e. the trade-off significantly changes if authentication/authorization has to be done at runtime with protected kernel calls).
i was at a pitch yesterday given by one of the sun java people ... and load time authentication & authorization design-point description was similar (although the setting is totally different).
From: Lynn Wheeler <lynn@garlic.com> Subject: re: 801 Date: 1995/06/13 Newsgroups: comp.os.linux.development.system,comp.arch,alt.folklore.computershad to been prior to '78, i was at conference '76 presenting 16-way smp design and the 801 group was presenting 801 and cpr operating system.
i remember because somebody from the 801 group flamed our presentation about being able to modify existing kernel to support 16-way smp because 'they had looked at the existing source and the control blocks didn't contain any fields that would support smp'. I guess they never heard of modifying control blocks.
in any case, i returned the favor by flaming their hardware deficiencies. their response was that they were writing a brand new closed operating system and that the hardware deficiencies were specifically selected as being performance/software trade-offs. The closed system implementation would do authentication and authorization checking at compile/load time ... and as a result runtime application code would have inline 'supervisor' code (w/o kernel calls) that would compensate for the hardware deficiencies. 801 processors would never be supported by any but the closed operating system with application inline kernel code (i.e. the trade-off significantly changes if authentication/authorization has to be done at runtime with protected kernel calls).
i was at a pitch yesterday given by one of the sun java people ... and load time authentication & authorization design-point description was similar (although the setting is totally different).
Newsgroups: alt.folklore.computers,alt.os.multics From: lynn@netcom3.netcom.com (Lynn Wheeler) Subject: Re: Who built the Internet? (was: Linux/AXP.. Reliable?) Date: Thu, 15 Jun 1995 13:55:07 GMT... then there was the internal network which originated out of another part of 545 tech sq., circa 1970. the one byte field reminded me of a similar story from 1976. about that time jes2 was thinking about inventing nje ... but sticking with the original hasp 1byte addressing scheme (originally used to identify "virtual" unit record devices ... but the left over numbers were going to be used to identify network nodes; typical jes2 system might have 60-80 virtual unit record definitions ... leaving something less than 200 for network identifiers). problem was that at that time the internal network already was between 500 & 700 mainframe nodes. The jes2 group basicallly replied ... oh well, no real customer would ever have such a large network. From then on the jes2 group was playing constant catchup, a couple years later when jes2 did a hack to expand support to 1999 nodes, the internal network was well over 2000 mainframes.
As a result jes2/nje systems could only play problematic end-point attachment roles on the internal net but could never be a reliable intermediate node being unable to address the complete network (i.e. the internal network was purely host based).
In fact the only reason jes2 could attach at all was a brilliant piece of layering in the native/backbone of the original internal network implementation. nje made the mistake of mixing the line driver protocol, the network protocol, the transport protocol, and the application protocol all in the same header record. two jes2 systems at different release levels frequently couldn't even talk to each other directly (to say nothing of attaching to the internal network).
In order for jes2 to attach to the internal network, the native networking code wrote a series of nje line driver emulators that would translate between native internal network and nje. The whole series of nje line driver emulators typically corresponded to different flavor/releases of nje &/or jes2. The nje emulation would encapsulate and pass real nje headers originating from jes2 systems ... for non-jes2 nodes, the drivers could fabrigate and/or strip nje headers as appropriate.
There are a whole series of stories about jes2/nje systems crashing other jes2/nje systems on the internal network. Typical scenerio goes a jes2/nje system at specific release level in hursley attempts to transmit something to a jes2/nje in san jose via the internal net. The intermediate backbone node in hursley has the appropriate jes2/nje line driver started, accepts the incoming transmission, appropriately encapsulates the nje header and starts forwarding it. It eventually arrives at backbone intermediate node in san jose which recognizes the node, passes the initial record to the appropriate nje driver emulator that de-encapsulates the nje header before forwarding over the link to the destination jes2/nje node. Since the intermediate node does the appropriate nje handshaking, the header record accepted and passed to jes2 processing directly. Since (at least) four layers of protocol were all jumbled together in the nje header ... minor field definition changes could really confuse the jes2 subsystem processing. Specific types of jes2 confusion would lead to panics ... which could then cascade into bringing down the whole mainframe system (early form of unintentional virus).
In typical fashion,
1) the initial fix was to require the backbone nje driver emulators to do at least field verification (for the appropriate release level) and where possible to do inter-release nje field conversion ... before forwarding to a jes2 system
2) the eventual "fix" was to force the native internal network software to abandon all drivers but officially sanctioned nje emulation drivers.
.... and now back to your regularly scheduled program
--
Anne & Lynn Wheeler | lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: 3330 Disk Drives Date: 18 Jun 1995 17:10:56 GMTfollowing from report i originally did in '81 ... part of the numbers were excerpted in postings in this group early last year (archived at
Some of the numbers were done in late '70s where it was shown that upgrading from 3330-ii to 3350 only improved performance if allocated data was limited to approx. same as on 3330-ii. The 3380 are the original (single density) 3380.
2305 2314 3310 3330 3350 3370 3380======================================================================
data cap, mb 11.2 29 64 200 317 285 630 avg. arm acc, ms 0 60 27 30 25 20 16 avg. rot del. ms 5 12.5 9.6 8.4 8.4 10.1 8.3 data rate mb 1.5 .3 1 .8 1.2 1.8 3 4k blk acc, ms 7.67 85.8 40.6 43.4 36.7 32.3 25.6 4k acc. per sec 130 11.6 24.6 23 27 31 39 40k acc per sec 31.6 4.9 13. 11.3 15. 19.1 26.6 4k acc per sec per meg 11.6 .4 .38 .11 .08 .11 .06
slightly different table ... assuming a uniform access distribution, loading the indicated max. data on the drive (i.e. not filling the whole thing) gives the resulting 4kbyte-block-access/sec/mbyte (i.e. 3380 with only 40mbyte loaded gives approx. the same performance as 2314).
2305 2314 3310 3330 3350 3370 3380 data cap, mb 11.2 29 64 200 317 285 630 4k acc. per sec 130 11.6 24.6 23 27 31 39 20 meg - .041 .091 .082 .098. .122 .152 40 meg - - .023 .021 .025 .031 .039 60 meg - - - .009 .011 .014 .017 80 meg - - - .005 .006 .008 .010
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
From: Lynn Wheeler <lynn@garlic.com> Date: Mon, 17 Jul 1995 22:34:32 -0700 Newsgroups: 2020world Subject: 2nd wave?Something somewhat spurious when I was trying to define the term business science a couple years back ....
Wisdom & Understanding /\ //\\ /// \\ /// \\ /planning\ /discovery\\ / // | \ / Knowledge \ \ / //workers ***\ /------------*****-\ / // *Group*\ / // **ware* \ / // ******** \ /* || ******* \ Process /*** / / ****** \M Systems---->/*s* / | \ \a /*k* | | | \t /*r** / / \ \e /*o** / / Infor- \r CAD /*w** | | Fact mation \i VLSI----->/*e*** / | Workers \ \a /*m** / | | \l /*a* | | | \s /*r* Dollars | \ \ /*F** / | | ****** \ /***** / People | ****** \ / | | | **VLDB* \ /----***-----------------------------------*******-----\ / *****/ | | ******** \ / ******* | | ********** \ / ***CIM**** | | \ / *********** / \ \ / ******* | Factory Workers | \ / / | | \ / / | | \ ----------+--------------+----------------------+--------------------- Complexity Distributed Real Dollars Information
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
From: Lynn Wheeler <lynn@garlic.com> Subject: Re: atomic load/store, esp. multi-CPU Date: 08/19/1995 Newsgroups: comp.unix.aixcs (compare&swap) is software simulation of ibm 370 instruction orginated in the early 70s (nearly 25 years ago at the ibm cambridge scientifc ccenter) to be both enabled UP & SMP "safe". however, the 6000 software simulation is only "enabled uniprocessor" safe (basically 6000 cs is a system call that simulates the c&s semantics in disabled kernel code) and does NOT provide SMP "safe" semantics (software simulation NOT providing hardware cache synchronization semantics across multi-CPU environment).
From: lynn@garlic.com (Anne & Lynn Wheeler) Subject: Re: Cache and Memory Bandwidth (was Re: A Series Compilers) Date: 1995/07/08 Newsgroups: comp.arch,comp.sys.super,comp.unix.craythere is also the custom rs/6000 4-way 801 machines that didn't do hardware coherency (effectively since day #1, no cache coherency has been fundamental 801 feature). memory segments can be labled write-shared or non-write-shared. a segment labeled write-shared doesn't get cached ... non-write-shared segments require software coherency ... analogous to support for i-cache & d-cache coherency or for processor & I/O coherency (i.e. lots of flush to memory).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
From: lynn@garlic.com (Anne & Lynn Wheeler) Subject: Re: Virtual Memory (A return to the past?) Date: 1995/09/27 Newsgroups: comp.arch.. some of you are probably getting tired of seeing this ... but a typical '68 hardware configuration and a typical configuration 15 years later
360/65 is nominal rated at something over .5mips (reg<->reg slightly under 1mic, reg<->storage start around 1.5mic and go up). running relocate increases 67 memory bus cycle 20% from 750ns to 900ns (with similar decrease in mip rate). 67 was non-cached machine and high I/O rate resulted in heavy memory bus (single-ported) contention with processor.
drums are ibm'ese for fixed head disks.
disk access is avg. seek time plus avg. rotational delay.
the 3.1l software is actually circa late 70 or earlier 71 (late in the hardware life but allowing more mature software). the 3081k software is the vm/hpo direct descendant of the cp/67 system.
90th percentile trivial response for the 67 system was well under a second, the 90th percential trivial response for the 3081k was .11 seconds (well under instantaneous observable threshold for majority of the people).
the page i/o numbers is sustained average under heavy load. actual paging activity at the micro-level shows very bursty behavior with processes generating page-faults at device service intervals during startup and then slowing down to contention rates during normal running. the 3081k system had pre/block page-in support (i.e. more akin to swap-in/swap-out of list of pages rather than having to individually page fault).
big change between 68 and 83 ... which continues today is that processor has gotten much faster than disk tech. has gotten faster. real memory sizes and program sizes have gotten much bigger than disk has gotten faster (programs have gotten 10-20 larger, disk get twice as fast, sequentially page faulting a memmap'ed region 4k bytes at a time takes 5-10 times longer). Also while current PCs are significantly more powerful than mainframe of late '60s and the individual disks are 5-10 times faster, the aggregate I/O thruput of todays PCs tend to be less than the aggregate I/O thruput of the mainframe systems.
In any case, when I started showing this trend in the late '70s that disk relative system performance was declining (i.e. rate of getting better was less than the getting better rate for the rest of the system) nobody believed it. A simple measure was that if everything kept pace, the 3081K system would have been supporting 2000-3000 users instead of 320.
Somewhat bottom line is that even fixed head disks haven't kept up with the relative system performance. Strategy today is whenever possible do data transfers in much bigger chunks than 4k bytes, attempt to come up with asynchronous programming models (analogous to weak memory consistency & out-of-order execution for processor models), and minimize as much as possible individual 4k byte at a time, synchronous page fault activity.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
From: Lynn Wheeler <lynn@garlic.com> Subject: 801 & power/pc Date: 1995/09/17 Newsgroups: comp.archthere is fundamental design change between 801/power and power/pc. I relate it back to conference i was at in '76 time-frame (almost 20 years ago) ... where 801 group flamed us for saying we were modifying existing operating system to support 16-way SMP. I returned the curtesy during their presentation.
base 801/power had various hardware optimization trade-offs that were based on hardware being only used in a closed, custom proprietary operating system (and would NEVER be used in any sort of open/general purpose system). two were small number of concurrent shared memory objects (reduced hardware) and optimized non-coherent cpu/cache operation (fundamental principle of 801 was it would NEVER be used in cache coherent &/or SMP configuration). Another fundamental principle was that security/authentication/etc would be done at compile/load time ... and all applications could directly execute all instructions (w/o kernel calls for runtime security/authentication) ... and therefor compiler could generate inline code to compensate for explicitly designed hardware shortcomings (in some sense, various design challenges were similar to current hot java).
translation to power/pc violates a number of fundamental, basic 801 principles established during the '70s/'80s. For instance, 801 never has to worry about serialization to any byte of data, once CPU has data fetched from cache, by definition it never has to worry about local cpu copy and cache/memory copy being in sync. not having to worry about synchronization allows for streamlined cpu pipelined operations that would be much more difficult in cache-coherent environment.
From: Lynn Wheeler <lynn@garlic.com> Subject: Re: Crashproof Architecture Date: 09/17/1995 Newsgroups: comp.archformalizing existing structures for saving virtual memory across system boots is relatively straight-forward. Much more complicated is potential huge amount of system state, especially associated with active i/o operations, pending signals, etc. there were a couple places in the early '70s that took existing commercial virtual memory operating systems to implement such function ... and the virtual memory save hack was possibly only 5-10% of the work.
at least one such place (early '70s) had a couple geographically distributed centers with large clusters. checkpoint allows any complex in a cluster to be periodically shutdown for maintenance w/o impacting non-stop applications. application could resume later on the same complex after it came back up ... but more frequently resumed immediately on another complex in the same cluster/center (and given certain types of restrictions regarding i/o activity could even resume on a complex located in a different cluster/center).
From: Lynn Wheeler <lynn@garlic.com> Subject: slot chaining Date: 1995/09/23 Newsgroups: comp.archi added slot chaining and ordered seek queueing to cp/67 around 1970 ... and typically could operate a 1/3mip to 1/2mip processor with 70-80 mixed mode users (batch & interfactive) running the processor at 100% utilization and <1sec 90 percentile response for trivial requests. this was with 2-3 2301s (fixed head, 4mbyte each) which would hit aggregate sustained thruput of 270 4kbyte page transfers per second on a shared i/o bus that operated at 1.5mbytes/sec. This was with less than 1/3rd of the total virtual memory pages resident on the fixed head devices ... the rest were typically movable head disks. The 2301s operated at 3600rpm and had 9 4k pages formated per two tracks (i.e. two revoluations to transfer 9 4k pages)
In the conversion to vm/370, the 2301s were upgraded to 2305s (same transfer but 12mbyte/device). I also introduced smarter page replacement algorithm (compared to 1hand & 2hand clock I had created in late '60s) as well as more sophisticated page migration algorithm (i.e. relocation of virtual memory pages betwen fixed head devices and movable head devices). Configurations which split 2-6 2305s across at least two I/O buses (1.5mbyte/sec each) could hit 600 page transfers/sec over two busses (channels for those mainframe folks). Total aggregate page transfers would exceed that because typical 2305 configurations only had capacity for 20% or less of the total virtual memory pages (rest on moveable head devices).
device latency = 0 is the wrong way to describe, the hardware was capable of ordering queued requests so that there was no device redrive/service delay between one request and the next ... and the programming for the 2305 turned out to be trivial. The 2305 supported multiple virtual hardware addresses. Simplest software support mapped each rototional position (or page slot) to a unique hadware address on the device. Effectively a two-level address ... 1st part selected the device (actually a range of hardware addresses) and the rotational position of the page selected the hardware sub-address ... i.e. from the sector number it was possible to calculate a predictable rotational position ... and each page start position in a revolution was assigned to a specific hardware address on the device. Hardest part was selecting the convention ... once that was done it effectively only added 4-5 instructions to mainline disk support (although all the migration stuff was much more sophisticated).
From: lynn@garlic.com (Anne & Lynn Wheeler) Date: 1995/09/25 Subject: SSA Newsgroups: comp.arch.storagessa, grump ....
a large number of the 9333 systems were for ha/cmp and we heavily backed the project. however we were also doing cluster scale-up using fcs. during san fran usenix, jan. 1992, Hester, my wife, and I had a meeting with Ellison, Baker, Shaw, Puri (and a couple others) in Ellison's conference room. We proposed having 16-way fcs pilot clusters in customer shops with parallel oracle by summer of 1992 ... with 128-way available by ye92.
unfortunately the kingston group were out trolling for technology and found cluster scale-up the very next week. in something like 8-10 weeks, the project was transferred to kingston, announced as a supercomputer, and we were instructed to not work on anything involving more than 4 processors.
in the elephant's dance to do the supercomputer subset of cluster scale-up ... the device interconnect strategy got obliterated. so instead of 9333->interoperable family (1/4 & full speed fcs, potentially 1/8 & 1/4 speed fcs on serial copper, etc); in the resulting confusion, 9333->ssa.
while ssa is quite good technology (especially compared to scsi), an interoperable fcs family strategy would be better. also for ha/cmp, it was important to be able to do fencing; at least for (some) hippi switches was able to get a fencing feature included, i haven't followed fcs much recently and don't know if any of the current fcs switches are able to fence.
trivia question: in the late '88 timeframe, what was the projected '92 per drop price for fcs (aka fibre channel standard; including prorated price of switch ... able to use for desktop operation)?
various posts mentioning our ha/cmp product:
https://www.garlic.com/~lynn/subtopic.html#hacmp
various posts mentioning original sql/relational implementation
https://www.garlic.com/~lynn/submain.html#systemr
+-+-+-+
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
From: Lynn Wheeler <lynn@garlic.com> Subject: characters Date: 1995/09/25 Newsgroups: comp.lang.misc,alt.folklore.computers,comp.sys.misc.. 'green' cards list the character mapping for ebcdic code (8bit, 256). even for printable characters there are some rough edges for ebcdic<->ascii translaters .. ebcdic doesn't have brackets, braces, etc.
i added tty/asci support to one of the mainframe operating systems while an undergraduate back in the 60s ... somewhat chose arbritrary mapping for ascii->ebcdic for incoming.
not sure about telco uses. i did a custom modified operating system that was installed at at&t longlines about '75 (effectively the resource manager that shipped in '76 but with lots of additional features, like a page mapped file system, etc .. that didn't ship). I remember getting a call around '83 or so from some IBM'er asking if i could do something since at&t still had a lot of machines running it (at&t had been moving it to next generation machines as they came out ... the opsys-base predated SMP support ... and the next generation hardware was smp only).
did do another project in the 60s as an undergraduate which was a terminal controller (someplace there is an article blaming us for originating the ibm oem control unit business). In any case, discovered an interesting feature of the standard ibm control unit; the line-controller (uart) reads the leading bit into the low-order bit position, i.e. if you ever looked at 'raw' ascii in the memory of an ibm mainframe (after it came in off a ascii device) ... you would find it bit-reversed,
I haven't paid any attention recently to see whether new generation of tcp controllers still support bit-reversal ... the terminal controllers i believe still have the convention. It would lead to some confusion on the mainframe side if the terminal side still did ascii bit reversal and the ip controllers didn't (i.e. would need two completely different set of translate tables when translation was done on the mainframe side).
From: Lynn Wheeler <lynn@garlic.com> Subject: multilevel store Date: 1995/10/04 Newsgroups: comp.archibm mainframe (and others) expanded store is somewhat along those lines ... except it pushed the architecture slightly further. borrowed some electronic disk and some from multi-level software controlled cache. implementations are at least 10 years old.
there is a very wide fast custom bus. at issue is that it is big and somewhat remote (in terms of nanoseconds), so latency for access is long by memory bus standards. however, the wide/fast bus ... once the transfer starts, completes on the order of <100 standard instructions. the interface paradigm is very long synchronous (cache-bypass, storage-to-storage) instruction. the trade-off is that normal async. device driver pathlength runs to several thousand instructions. the synchronous paradigm busies the cpu for a very long period of time (<100 instructions, actually i haven't seen the current timings, 10 years ago, it was about 20 instruction timings) ... but that is only a small fraction of what a device driver paradigm would cost.
next, previous, subject index - home