Newsgroups: comp.unix.large,comp.arch.storage From: lynn@netcom4.netcom.com (Lynn Wheeler) Subject: Re: Big I/O or Kicking the Mainframe out the Door Date: Wed, 5 Jan 1994 16:31:59 GMTtypical unix has two issues:
synchronous I/O & buffer copies
standard system strategies for read-ahead and write-behind somewhat compensate for synchronous I/O (although it contributes to a level of unpredicability associated with various failure modes).
memory mapped strategies aren't a complete panecia. Sequential access with mmaped files (i.e. page-faults) can actually be worse than freads unless kernel also has a mmap'ed "read-ahead" strategy (similar to buffered I/O read-ahead).
memory-mapped I/O can actually come in several flavors. In the early '70s I did a variation on memory-mapped I/O integrated in with a mainframe filesystem (which was never shipped as a product, but I maintained at various installations over the next 10-15 years). Repeat-Note: the numbers/examples posted previously was for systems w/o the mmap'ed filesystem.
advantages of the mmap'ed filesystem modification (even for mainframe system that already supported I/O schema supporting asynchronous I/O AND direct I/O ... noncopy ... transfer):
1) eliminated certain duplicated kernel function activity (in file I/O) ... net effect was a cut of 60-90% in kernel file I/O pathlength
2) mmap'ed interface allowd specification of mapping an arbitrary number of disk blocks to an arbitrary span of virtual pages ... along with advisery flags indicating Synchronous, Asynchronous, Defferred
• Deffered - effectively equivalent to performing mmap function w/o any data transfer (left for the application to page fault the data) • Synchronous (on reads) - data will be transferred before returning to the application • Asynchronous (on reads) - schedule data transfer but return immediately to application
the synchronization of asynchronous activity relied on controlling whether the application had access to the virtual page. It was even possible to take a multi-buffered, asynchronous I/O application and simply have the file I/O subsystem to replace the standard I/O kernel calls with mmap kernel calls (assuming buffers are page-aligned). Thus it was possible to emulate an existing file I/O, direct transfer, multi-buffer, asynchronous paradigm (with the interface) as well as the more common full-file mmap'ing.
3) large mainframe large (multi-block) I/Os were scheduled in virtual address order. The page I/O subsystem had the capability for out-of-order transfer which reduced the latency ... effectively similar to out-of-order transfer supported by some of the sophisticated caching controllers (i.e. controller can begin transfer at current head position ... rather than waiting for some specific "starting" point).
4) direct mapped (mainframe) I/O reads tended to leave the virtual pages dirty and direct mapped I/O writes writes did nothing for the page dirty status. Running the I/O thru the paging system resulting in not having the dirty bit on (for reads) and cleaned the dirty bit (on writes). The result was much lower ratio of dirty pages (which would have to be written if selected for replacement).
In the previous posting giving 1970/1983 comparison, the '83 system had extensive (but pretty non-intrusive) instrumentation of I/O activities. All I/O requests (including page I/O) were time-stamped to measure both queueing time and service time. The queueing & service times were accumulated (by category) for individual processes, individual disks, and total overall system. Other instrumentation made it possible to caculate (with resolution of several microseconds) for each process:, total elapsed active time, total cpu service time, total cpu queueing time, total page I/O queueing time, total page I/O service time, total file I/O queueing time, and total file I/O service time, and total "blocked" time (waiting for some service). For the '83 3081 system example, the file I/O queueing+service times ran significantly higher than the page I/O queueing+service times.
An an aside, a measure of contention (i.e. scheduling to the bottleneck) is not the measure of the use of a resource, but the time spent queued/waiting for the resource to be available (i.e. contention is queueing time ... not service time or queue+service time).
In any case, for a moderately heavy file I/O workload (circa '83), the mmap'ed changes could reduce the measured file I/O queueing time by a factor of three (as well as some other measureable performance improvements).
As a separate aside, I briefly saw a posting referencing caching strategies and "Mattson". I didn't catch the context. I've run across hardcopy to '78 kernel code that was installed on several large mainframe systems to capture the live-load I/O activity data ... & was the real-time feed into Mattson's caching model. I'm not sure the posting was referencing the paper written on those results ... or was looking for some other reference.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From lynn@netcom4.netcom.com Tue Oct 19 07:45:23 1999 Newsgroups: comp.unix.large,comp.arch.storage From: lynn@netcom4.netcom.com (Lynn Wheeler) Subject: Re: Big I/O or Kicking the Mainframe out the Door Date: Wed, 5 Jan 1994 16:34:26 GMTOne of the potential pitfalls of simple mmap implementations for full I/O mapping is the "cleaning" of pages once they have been used. In a buffer'ed paradigm, the application explicit declares that it is done with some data by requesting it to be overlayed. This has the net effect of "reducing" the applications working-set size (as opposed to a straight file mmap'ing and waiting for the sysetm to discover the pages are no longer in use).
I had done a fair bit with various variations clock-like global LRU replacement algorithms (and working set controls) during the late '60s. During the early '80s there was detailed modeling work comparing them with both "true/exact" LRU replacement and "optimal" replacement. The global clock replacement algorithms all tended to operate within <15% of the performance of true/exact LRU-replacement ... including getting worse when LRU got worse.
About that time, I stumbled on a interesting twist to clock global LRU (I was originally trying to find a way of improving cache performance and reduce inter-processor contention in a SMP environment).
There are times when LRU-replacement appears to be "chasing its tail", i.e. it is choosing for replacement the exact virtual page that would be needed next. It such scenerios, it would be very desirable to a switch to a near MRU (most recently used) replacement algorithm.
The net result of the algorithm variation was that it tended to operate as a LRU-replacement algorithm in the part of the envelope where LRU did well ... but would effectively switch to a random replacement algorithm in parts of the envelope where LRU was doing poorly (i.e. random was significantly better than LRU).
It was interesting that the implementation had no code to explicit recognize the condition and change the behavior ... it was somewhat the fallout of the way it treated the distribution of reference & non-reference page patterns. The implementation code is very close to standard clock global LRU replacement, with equivalent pathlength, but does have better cache hit profile as well as better SMP operational characteristics.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From: lynn@netcom4.netcom.com Tue Oct 19 07:45:24 1999 Newsgroups: comp.arch From: lynn@netcom4.netcom.com (Lynn Wheeler) Subject: Re: Register to Memory Swap Date: Mon, 17 Jan 1994 00:24:25 GMTTest&set has been around in smp since at least the early/mid-60s (i.e. test a location for zero/non-zero at the same time setting it to non-zero).
Atomic update (based on the existing value) is (I'm reasonably sure) result of CAS's work in 70/71 time-frame at 545 Tech Sq. (which led to the original mnemonic compare&swap ... i.e. his initials). The use of atomic compare&swap for multi-threaded, "enabled" applications (whether SMP or non-SMP) dates from the same period.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: Multitasking question Date: Mon, 7 Mar 1994 06:59:22 GMTin '68/'69 i completely replaced cp/67 implementation with dynamic adaptive feedback stuff. The implementation i replaced was for all intents and purposes the same as described in the 4.3bsd book. my understanding from the cp/67 developers were that several of them had come over from ctss/7094. It is possible/plausable that both cp/67 and unix can trace a common ancestry back to ctss/7094.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: Re: Multitasking question Date: Wed, 9 Mar 1994 23:50:32 GMTThe initial version (1.0, 1967) of the cp/67 scheduler (that I saw sometime in 68) was like your CTSS description, I believe 10 queues, top queue having very short "time-slice" and each lower queue have progressively larger time-slice, If process went to t/s-end, it was moved to the tail of the next lower queue. If process became blocked (during execution) it would move to the next higher queue.
Two biggest problems with it were lack of any page-thrashing controls and high cpu overhead in the implementation (including non-linear increase with respect to increasing numbers).
Somebody out at Lincoln Labs (summer '68?) did a 2-level queue replacement that had a table (table values set proportional to machine real storage size) that limited the number of "active" processes in each queue. "queue-slices" were on the order of a second and "blocking" didn't result in queue transition. There was daemon that once a second recalculated queue position based on cpu use & aging (this bore more similarity to bsd description).
Problem with this implementation was that the page-thrashing control didn't take into account process/program behavior ... and the cpu overhead for dispatching/scheduling function was still non-linear (although significantly reduced from ctss look-alike).
The Dynamic-adaptive changes (late '68, '69):
1) eliminated cycling through processes & therefor any non-linear & "scaling" problems
2) cpu process overhead per dispatch/schedule (proportional to work done & not number of users) was reduced to near zero
3) dynamic adaptive page-thrashing controls based on program behavior, real-stroage availability, AND efficiency of paging subystem (also originated clock global LRU replacement, existing algorithm was close to FIFO). Also redid most of the paging ode so the pathlength was as close to zero as possible.
4) execution order was based on an "advisery-deadline" calculation. the actual calculations took into account a variety of factors, individual program behavior (like cpu use, paging behavior, etc) as well as overall system behavior (cpu bottlenecked, paging bottlenecked), interactive, etc. Future advisery deadlines were proportional to program characteristics and granularity of cpu allocation. "Interactive" programs were allocated small cpu granularities frequently (frequency interval was proportional to cpu granularity), however being interactive (or non-interactive) didn't affect aggregate cpu allocation (just granularity with frequency proportional to granularity size).
In general:
1) near zero pathlengths (& elimination of non-linear scaling)
2) dynamic adaptive ("scheduling to the bottleneck")
3) consistent resource control regardless of factors like interactive/non-interactive ... such factors could affect granularity of allocation but not magnitude of allocation
Genoble Science Center published a paper in CACM sometime in the early '70s describing their implementation of a "working-set" dispatcher on CP/67 (effectively faithful implementation described in Denning's article). They had a 1mbyte 360/67 (which left 154 4k "pageable-pages" after fixed kernel requirements). They provided about the same level of performance for 30 users as we did for 70 user on a 768k 360/67 (104 4k pageable pages).
The differences (circa '70-71):
grenoble cambridge
machine 360/67 360/67
# users 30-35 70-75
real store 1mbyte 768k
p'pages 154 4k 104 4k
replacement local LRU "clock" global LRU
thrashing working-set dynamic adaptive
priority cpu aging dynamic adaptive
There were also big differences in "straightline" pathlength as well
non-linear scaling pathlength effects.
Some of the pathlength stuff I came to regret. I would rearrange a couple hundres of lines of code in half-dozen modules so that the sequence of events came out implicitly like I wanted them to ... rather than to have to explicitly write code to make it happen (zero pathlength implementation). Some of the code made it into the product and possibly years later get modified (and things stop working the way they should ... it is hard to explain about "implicit" workings).
Note that the '67 had a 900microsecond cycle time and no cache. Most instructions were slightly less than 2machine cycles to over 3-4. Compute bound with no I/O, machine might clock .5-.7 MIPS. However, heavy I/O could result in some severe memory bus contention (with the instruction unit) cutting MIP rate significantly.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: Schedulers Date: Fri, 11 Mar 1994 10:36:17 GMTCP/67 & VM/370 Scheduling
... long ago, and far away ... hopefully this doesn't sound too much like a biography.
As an undergraduate in '68 & '69 I did a lot of cp/67 modifications, high-performance fastpath, dynamic adaptive dispatching, dynamic adapting scheduling & thrashing controls, page replacement algorithm (clock global LRU), teletype support, and numerous other things. Also was one of four people responsible for designing/building the first non-IBM control unit for IBM mainframe. Also did some MFT/MVT hacking, ripped out the 2780 support from HASP-III and replaced it with 2741 & TTY support along with an editor that implemented the CMS/CTSS syntax (early '69). The university I was at also had a IBM 2250 (high performance graphic display) and I hooked the backend of the CMS editor to 2250 character library to create a full-screen editor (late '68).
In '70, I graduated and joined IBM/CSC at tech sq. I was able to get some number of my changes incorporated into various cp/67 product releases.
With the advent of the IBM 370, the "product" group split off, did a somewhat grounds up rewrite of for vm/370 and moved out to a bldg. in Burlington Mall.
When CSC finally got a 370 in 73, I ported a lot of my changes from
cp/67 to vm/370. I had at the time two BU students that were on
work/study program. Much of the scheduling/thrashing descriptions I've
already posted earlier in this thread. On what was called VM/370
"release 2 plc 15" we had done a whole lot of changes including the
following:
automated operator procedures (done to support some
sophisticated benchmarking methodology)
dynamic shared library support (changes to both
cp and cms)
high performance page-mapped file system (both cp & cms)
dynamic adaptive feedback dispatching/scheduling
(including thrashing controls)
page replacement (clock global LRU & something better than LRU)
page "migration" (moving pages around between high
performance fixed-head disk and movable
arm disks)
various disk & page I/O subsystem optimization
more "fastpath" optimization
The three of us ran an "internal IBM" product group supporting the
modifications on the CSC mainframe and also packaging and shipping the
modified product to internal IBM sites (system was operational on some
100 mainframes at internal IBM sites). Through a special agreement it
as also shipped/supported for some AT&T sites. There was approximately
30,000 lines of new &/or changed code.
The benchmarking methodology included a lot of synthetic workload stuff. There was extensive monitoring and profiling of production systems, synthetic workloads were created to simulated production characteristics and then validated/tuned. Along with this was heavy instrumentation of the kernel (some of which was also necessary for being able perform dynamic adaptive calculations).
Typical benchmarking process would build a specific kernel, boot the kernel, initialize parameters as specified, run a specific synthetic workload, gather all the data, kill all synthetic workload processes and then go on to the next benchmark. We would typically (automatically) kick this off at midnight on Friday and it would run totally automated until 8am Monday morning when it would rebuild the "production" kernel and bring up the machine in "production" mode for normal users. We could sometimes get 100 separate synthetic workload benchmarks run over the 56hr weekend.
A subset of the automated operater support and the dynamic shared library support was picked up by the vm/370 development group for the "basic" release 3 product.
About the time that the base "release 3" was shipped a decision was made to make an "add-on" software product release of the CSC "performance enhancements" (I believe the SHARE scheduler white paper refered to them as the "Wheeler Scheduler"). Unfortunately at the time, both of the BU students had gone on to other things and I was doing a scalable SMP project.
In any case, I went to work part time on turning out this "Resource Manager PRPQ". Besides the technical work (see following) this was going to be the first IBM SCP (system control program) software that was charged for. As a result I spent possibly more time on various "business" stuff than on the technical things (helping formulate how IBM was going to charge for SCP software).
A set of benchmarks was established for validating the design/implementation, with systematic variations in the synthetic workload, in the hardware configuration, in the kernel paramenters and characteristics. The benchmarks were designed to cover all conceivable operational environments that the software might be used for. There were also "clunkers" ... if a nominal heavy load for a particular configurations was 100 users ... several tests at 800 synthetic users were run. In order to perform such stress tests, numerous timing-dependent bugs in the base system had to be found and fixed ... as well as a redesign of the kernel synchronization mechanism to eliminate all possible cases of zombie processes.
On the order of 2000 (new) separate benchmark tests were run in the process of (re)validating the RM-PRPQ. Included in the tests were priority changes (nice'ing) of numerous kinds. The default dynamic adaptive mode was to assume "fair share" resource allocation .... however administrative priority changes (nice'ing) was defined to have very specific effects as to process resource allocation. This had to be verified across a wide combination of possible configurations and workloads (as well as demonstrating that each nice'ing increment exactly resulted in the defined resource allocation change ... administrative controls allowed specification of either more & less resources than fair share ... or specific percentage of total system resources). The instrumentation/monitoring of a large percentage of the "internal IBM" sites running the code was also used to help calibrate/validate the dynamic/adaptibility of the code.
The RM-PRPQ did not contain the page-mapped filesystem changes, but included several of the performance enhancements. The RM-PRPQ had a couple of new modules and 6500 lines of code (which included modifications to 60-some existing CP kernel modules).
The (IBM) CP kernel module naming convention was a three letter perfix ("DMK" for all modules in the CP kernel) followed by a three letter module designation. The RM-PRPQ module responsible for most of the the dynamic adaptive resource logic was named DMKSTP. I believe that the number of mainframes licensed to run the RM-RPQP went over 1000 (mid-70s). I also got the job of being 1st, 2nd, & 3rd level field support for the product for the first 9 months after it shipped.
Come 1983, I had almost forgotten the R2PLC15 system that we had been supporting at AT&T back in '74. The IBM branch office called and wanted help in getting AT&T off the system. Apparently as each new processor came out AT&T would migrate the software to the new generation of machines. What was interesting/gratifying was that while the dynamic adaptive implementation was dynamically adaptive ... there was well over a magnitude difference (>*20 the performance) between the 1974 machines (that R2PLC15 had been calibrated on) and the 1983 machines.
In any case, the cp/67 and the vm/370 (at least after the RM-PRPQ) thrashing controls essentially became the same.
See Melinda's paper for more/other details.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: Schedulers Date: Fri, 11 Mar 1994 10:45:08 GMTthrashing footnote ... actually for the stress tests the trashing controls worked too well. To really get some of the stress tests working I had to build special kernels with the thrashing code crippled. With 5* to 10* the nominal number of users expected to be found in a heavily loaded system, AND the thrashing controls crippled ... could get some real stress going in various parts of the system (like 5-15 seconds elapsed time to service a page fault ... when the system was paging at 300/sec).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: Re: Schedulers Organization: NETCOM On-line services Date: Fri, 11 Mar 1994 20:34:32 GMTmore stories out of school?
Basically global clock runs around all pages in real storage resetting reference & testing reference bits. When it finds a page that doesn't have its reference bit set on ... the page is selected for replacement.
A "local" LRU algorithm is limited on a process by process basis for running around virtual address space resetting & testing reference bits.
The original CP/67 "algorithm" just cycled thru real storage looking for a page that didn't belong to an active process (no resetting &/or testing of the page's hardware reference bits). If it didn't find one, it would pick the first available page. Assuming all real storage was "occupied" this algorithm effectively approximated FIFO. It also had pathlength performance penalties since all real storage had to be scanned prior to deciding it was just doing FIFO.
The original VM/370 algorithm had a threaded list of virtual pages that was supposedly scan'ed by a "clock" global LRU algorithm. However there was other code that was constantly removing pages from one portion of the threaded list and re-inserting them elsewhere. The removal & re-insertion had some bad side effects ...
1) the "effective" ordering no longer was the average time since a page had its reference bit reset ... this "average time" needed to be relatively uniform for ALL pages for the "global scan" for the algorithm to approximate real LRU. a supposedly minor "implementation" change totally negated what made clock global LRU an approximation to real LRU.
2) removal and reinsertion of pages were process specific, this had the effect of repeatedly "clustering" all virtual pages for a process together ... the result would be that the "replacement algorithm" would be raiding pages from specific processes in "bursts" ... with long bursts of not raiding any pages for a specific process
3) there was very inconsistent treatment of "shared" pages which were located in virtual address space of multiple processes simultaneously. Prior to "release 3" shared library changes the maximum number of these pages were well capped (nominally 16 max). After the "release 3" changes, the maximum number of "shared" pages started to explode (and so did the side-effects of not correctly handling them).
4) various hardware technology & configuration evolutions resulted in numerous workload environments where the virtual page mean elapsed resident lifetime exceeded the mean threaded list shuffle interval. The "clock" global LRU implicitly assumes a uniform "reset interval" ... creating something of descrimination function between 1) all pages that get used more frequently than the "reset interval" and 2) all pages that were used longer in the past than the reset interval. I came up with the "clock" global LRU implementation in the '60s specifically because it had the advantage of dynamically adapting the "reset interval" proportional to the demand for pages (w/o explicit code being required ... it also has one or two other implicit "dynamic adaptive" characteristics). The threaded list shuffle wiped out all "memory" of page reference history, When the mean virtual-page residence lifetime exceeded the mean threaded-list shuffle interval ... all relationship to a LRU-replacement algorithm disappeared.
Another example of "implementation" optimization invalidatiing "algorithm" architecture was the mainstream IBM operation system effort. It was no secret that the VM/370 product was viewed by "strategic" IBM as an orphan child. During the '70s, customers were frequently told that the last VM/370 release had already been shipped. Internally, people were told that if they ever wanted a career path and/or promotion that they had to transfer to the "mainstream, strategic, operating system product".
In any case, in the early '70s the "mainstream" product was preparing to embrace "virtual". The non-virtual design implemented a single real address space for all of the kernel as well as all executing processes/applications. The translation the "real" system to "virtual" effectively created a single large simualted real address space with the virtual hardware (changing the name from MVT to SVS or "single" virtual system). there were some tricks behind the scenes to map (& fix/pin) various kernel pages to the same real address as their virtual address.
I did some consulting to them regarding LRU replacement algorithms ... however their OR simulation group "discovered" that system performance would be better if the replacement algorithm was biased towards selected "non-changed" virtual pages prior to "changed" virtual pages. They were un deterred by arguments about the replacement algorithm no longer approximated LRU-replacement. They shipped the product with the "discovered" performance improvement by the OR simulation group. It was relatively late in the SVS product cycle that somebody observed that the "majority" of non-changed virtual pages were "shared" library code used by all applications and the "majority" of changed virtual pages were primate application data pages (i.e. the implementation was biased to replace high-usage pages that were commonly referenced by all applications before private pages ... replacing high-usage shared pages had two down-sides ... individual process page fault rates ... plus tending to serialize/block multiple applications simultaneously).
The original VM/370 replacement algorithm had the "shared" page problem of not being able to discriminate between "high-usage" shared pages and "low-usage" shared pages that haven't been touch in an extended period of time. The mainstream SVS implementation went to the other extreme and was actively biased towards replacing shared pages (regardless of high or low usage).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: Schedulers Date: Fri, 11 Mar 1994 21:28:31 GMT Newsgroups: alt.folklore.computersRM-PRPQ footnote:
one other feature that went out in the RM-PRPQ product was "pageable pagetables". VM/370 required 23.5 double words (188bytes) fixed real storage for every 64k bytes of virtual memory (about 12k-real/1mbyte-virtual). A hypothetical configuration with 300 processes, each with 16mbytes of virtual memory, would have required on the order of 57 mbytes of fixed real storage for tables in the standard vm/370.
benchmark footnote:
the benchmark suite started with a synthetic workload. First a large number of real live workloads were profiled ... and then some composite synthetic workloads were put togetm running the synthetic workloads were then cross-checked against the live-load profiles for final calibration.
An operational envelope profile was put together from data regarding "heavy load" operation of a large number of real live systems. Approximatly selected from the outer edges of the heavy-load operational envelope as well as representative (&/or common) points within the envelope. Twenty-four composite synthetic workloads were created who's execution profile matched the selected operational poi utilization, resource utilization distribution were all profile factors).
A typical "validation" for some code change typically required running the complete suite of 24 operating point benchmarks with 4-5 different tuning options (i.e. 96-120 total benchmarks).
trivial response footnote:
There was a paper by some group in the late '70s claiming that they had the best performing (vm/370) timesharing service with 300+ logged on users, 100% processor utilization and 90th percentile trivial interactive response of .24 seconds.
I had a guinea pig production installation at the time with similar workload and configuration profile. The major difference was that I had deployed my high-performance page-mapped filesystem (that never was included in the product) along with several additional dispatching, scheduling, and paging enhancements (including "remembering" members of a previous working set and "block" paging when process reentered the queue). With a similar configuration and workload, this configuration had a 90th percentile trivial interactive response of .11 seconds.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From: lynn@netcom7.netcom.com (Lynn Wheeler) Newsgroups: alt.hypertext,comp.infosystems.interpedia Subject: Re: link indexes first Date: Fri, 11 Mar 1994 19:58:03 GMTwe've looked at some "bi-directional" link challenges. One scenerio is a departmental CD-ROM server. The CD-ROM can be "pressed" with an arbitrary number of bidirectional links ... but departmental "views", individual "views" and/or overlayed individual/departmental "views" require r/w update capability. We addressed the opportunity with "virtual" subjects (stored in r/w database) that had "one-way" pointer to the "real" subject (possibly located on the r/o cd-rom). Departmental, individual, and overlayed individual/departmental views involve transparently merging the "virtual" and the "real" subject along with all the associated relations.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: IBM 7090 (360s, 370s, apl, etc) Organization: NETCOM On-line services Date: Wed, 23 Mar 1994 19:27:16 GMTFor cp/67 .... csc did the original port of apl/360 to cms/apl. It included some extensions for doing "cms file i/o". Also there was a peculiar problem with apl memory garbage collection.
apl would nominally not re-use any storage location ... almost any value "modification" involved allocating the next available memory location for the new value (and ignoring the "old" location). When memory allocation reached the end of storage it would garbage collect, compress variables down to low-memory addresses and restart. csc had some extensive monitoring tools to do full instruction and storage reference monitoring (some of the technology was eventually released as the VS/Repack product in the mid-70s ... among other things, if given a load-map it would use "cluster-analysis" to do program re-arrangement for improved virtual memory operation). One of the tools could produce a printed chart of memory references (we had 6' high outputs scotch-tape together along 10-15 feet worth of wall) against time. APL operation would typically have this very pronounced saw-tooth effect with a sharp rise using all of (virtual) memory and then straight line collapsed when end-of-memory was reached and garbage collection would run.
this wasn't too bad virtual memory characteristic with 32k-100k byte APL workspaces ... but it turned out that a lot of the people using cms/apl on the csc machine were doing it because they could get 1mbyte and larger (virtual memory) workspaces (in addition to cms file i/o). Effectively apl would utilize all available virtual memory regardless of the size of the apl application/program running. In order to handle that, csc developed an optimized virtual memory garbage collector for apl.
Some of the IBM 370 machines also had an interesting virtual memory, cache implementation. The 370 architecture had a mode-bit that selected between 2k & 4k virtual pages. The dos/vs & vs1 operating systems ran with 2k virtual pages and svs/mvs ran with 4k virtual pages. vm/370 nominal ran 4k virtual memory ... but if emulating a virtual machine in relocate mode ... it would use whatever page-mode the virtual machine specified. The caches were "real-mapped", but some machines would start cache line selection using "low-bits" of the page displacement. On the 64k cache 370/168 it would use the "11th bit" (i.e. 2k) as part of the cache-line selection. However, when switching >from 4k->2k virtual relocate mode ... the 168 would invalidate all cache-line entries and switch to being only a 32kbyte cache machine. Going from 2k->4k mode it would again invalidate all the cache line entries and then switch back to being a 64kbyte cache machine. There was at least one case where a dos/vs/vm "shop" upgraded from a 32kcache 168 to a 64kcache 168 only to find their performance significantly degrade.
The 370/168 had a 7 deep tlb sto (i.e. each tlb entry had a 3bit identifier, "invalid" and seven possible address-space "ownerships"). It was somewhat "tuned" for SVS/MVS. The SVS/MVS design reserved the 1st 8mbytes of virtual memory for kernel code which left the 2nd 8mbytes (in 16mbyte virtual address space) for application code. On the 370/168, one of the tlb index bits was the virtual address 24bit (8mbyte). This worked out well for SVS/MVS ... but hampered cms, dos, vs1 running on the machine since in most nominal environments, all virtual address were <8mbyte (with effectively no virtual addresses >8mbyte, half of the tlb entries alwas went unused).
When our scalable SMP projects got canned (first a 5-way and then a revived 16-way) ... we adapted the design/implementation to a standard 370 2-way (actually when the 2-way support was released as part of the base VM product, they had to do some interesting product "repackaging" since the implementation was dependent on a large part of the code in the RM PRPQ) ... we had done some optimizations for cache affinity management and pre-emptive task-switching ... which (on 2-way) resulted in situations were the "MIP" rate on one of the processors was effectively near the nominal uni-value ... but the other processor would hit a "MIP" rate nearly 50% higher. Several of the vm "performance" monitors only paid attention to % kernel cpu utilization and % process cpu utilization ... in some cases running at a 50% higher MIP rate would superficially appear as if less work was being performed.
In late '77 & early '78 I helped put together a cluster system of eight 2-way MPs (initially 168s but upgraded to 3033s) all sharing the same disk farm. It was used to provide primarily APL-based application service (with typical apl workspaces running around 1mbyte or larger) ... at the time it was the largest "single system image" operation.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: scheduling & dynamic adaptive ... long posting warning Date: Thu, 24 Mar 1994 00:18:25 GMTWith respect to scheduling & dynamic adaptive feedback control system postings a couple weeks ago (hot button of mine) ... although not directly related to computers (might be better placed in alt.folklore.military? ... anyway) ...
Col (ret) John Boyd has had some fascinating things to say about operating inside your opponent's OODA-loop (observe, orientate, decide, act; feedback loop). A lot of his thoughts about increasing feedback loop performance seemed to orginate from his background as a fighter pilot during the Korean War coming right out of "plane turn radius" in dog fights. I've had the privilege of sponsoring his talks several times. I had done a lot in the late '60s and early '70s with dynamic adaptive control systems using feedback loops and operating envelopes so I was quite taken with Boyd's OODA-loop concept and plane operating envelopes.
At one time, Boyd was in charge of lightweight fighter plane R&D at the pentagon ... he had also ran a "skunk-works" responsible for the early F16 design. Prior to that he had developed something called "Boyd's Laws" which he evolved into a 300 or so page fighter pilot training manual. Basically it involves looking at different plane's performance envelopes along several axis simultaneously (graphs look like lopsided circles). You overlap two of these envelope/graphs for two different planes ... and it suggests where your plane operates best vis-a-vis an opponent ... and conversely the same for them. He says that later the CIA translated a Russian fighter pilot training manual ... and it was word-for-word his document except for simple changes like feet & miles to meters & kilometers. The use of these performance envelopes then evolved into being used for plane design for things like in what areas do you want improvements ... and what areas are you willing to take sacrifices/trade-offs.
Boyd also has seemingly hours of stories about technology/science "not working" and/or at least being used incorrectly. I've suspected that he also had a hand in the F20/Tigershark (since it conformed with lots of his statements about designing a plane that had a long MTBF and a typical enlisted person could repair/service quickly, i.e. flying time was much greater than down/service time).
US News & Report had a short article on him during Desert Storm titled "The Fight To Change How America Fights" ... also mentioning the "Jedi Knights" (6May1991). I remember a briefing (on CNN) given by some Col. two days into the war that talked about how strategy & tactics had changed ... using phrases that Boyd was using/advocating at least 10 years earlier.
Boyd had a two part talk, 1) Patterns of Conflict, and 2) Organic Design For Command and Control. Patterns of Conflict is the longer of the two talks. The last four foils list over 200 references. By comparison, Organic design for command and control has less than 1/5th as many foils. While both talks draw heavily on historical examples from warfare, the real focus of the talk was fundamental principles of how to be successful in a competitive environment. Any typos in the attached excerts are mine.
... from Patterns Of Conflict:
• Sun Tzu (around 400 BC)
Probe enemy to unmask his strengths, weaknesses, patterns of movement
and intentions. Shape enemy's perception of world to
manipulate/undermine his plans and actions. Employ Cheng/Ch'i
maneuvers to quickly and unexpectedly hurl strength against
weaknesses.
• Bourcet (1764-71)
"A plan ought to have several branches. ...One should...mislead the
enemy and make him imagine that the main effort is coming at some
other part. And...one must be ready to profit by a second or third
branch of the plan without giving one's enemy time to consider it."
• Napoleon (early 1800's)
"Strategy is the art of making use of time and space. I am less chary
of the latter than the former. Space we can recover, time never. "...I
may lose a batte, but I shall never lose a minute." "The whole art of
war consists in a well reasoned and circumspect defensive, followed by
rapid and audacious attack."
• Clausewitz (1832)
Friction (which includes the interaction of many factors, suchas
uncertainty, psychological/moral forces and effects, etc.) impedes
activity. "Friction is the only concept that more or less corresponds
to the factors that distinguish real war from war on paper." In this
sense, friction represents the climate or atmosphere of war."
• Jomini (1836)
By free and rapid movements carry bulk of forces (successively)
against fractions of the enemy.
• N.B. Forrest (1860's)
"Git thar the fustest with the mostest."
• Blumentritt (1947)
"The entire operational and tactical leadership method hinged upon...
RAPID, concise assessment of situations, ...QUICK decisions and QUICK
execution", on the principle: "each minute ahead of the enemy is an
advantage."
• Balck (1980)
Emphasis upon creation of "implicit connections or bonds" based upon
"trust, not mistrust", that permit wide freedom fro subordinates to
exercise initiative and imagination -- yet, harmonize within intent of
superior commanders. Benefit: internal simplicity that permits rapid
adaptability.
• Yours truly (i.e. John Boyd)
Operate inside adversary's observation-orientation-decision-action
loops to enmesh adversary in a world of uncertainty, confusion,
disorder, fear, panic, chaos, ...or fold adversary back inside
himself, so that he cannot cope with events/efforts as they unfold.
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
Quite frequently, Boyd's foils are "black". A few foils From "organic design for command & control":
foil 25
commment
-------
Up to this point we have show orientation as being a critical element
in command and control -- implying that without orientation there is
no command and control worthy of the name.
? - raises question - ?
-----------------------
what do we mean by command and control?
foil 26
Some Historical Snapshots
------------------------
Before attempting to respond to this question let us take a look at
some evidence (provided by Martin Van Creveld as well as myself) that
may help in this regard:
* Napoleon's use of staff officers for personal reconnaissance
* Maltke's message "directives" of few words
• British tight control at the Battle of the Somme in 1916
* British GHQ "phantom" recce regiment in WW II
* Patton's "household cavalry"
• My use of "legal eagle" and comptroller at NKP
foil 27
A Richer View
-------------
(a la Martin Van Creveld -- "Command" -- 1982)
In the June 1967 War, "... General Yashayahu Gavish spent most of his
time either 'accompanying' units down to brigade level -- by which,
according to his own definition, he meant staying at that unit's
command post and observing developments at first hand -- or else
helicoptering from one unit to another; again, in his own words,
'there is no alternative to looking into a subordinate's eyes,
listening to his tone of voice'. Other sources of information at his
disposal included the usual reporting system; a radio network linking
him with the three divisional commanders, which also served to link
those commanders with each other; a signals staff whose task it was to
listen to the divisional compunctions networks, working around the
clock and reporting to Gavish in writing; messages passed from the
rear, i.e., from General Headquarters in Tel Aviv, linked to Gavish by
'private' radio-telephone circuit; and the results of air
reconnaissance forwarded by the Air Force and processed by Rear
Headquarters. Gavish did not depend on these sources exclusively,
however; not only did he spend some time personally listening in to
the radio networks of subordinate units (on one occasion, Gavish says,
he was thereby able to correct an 'entirely false' impression of the
battle being formed at Brigadier Gonen's headquarters) but he also had
a 'directed telescope' in the form of elements of his staff, mounted
on half tracks, following in the wake of the two northernmost divisions
and constantly reporting on developments."
foil 28
Point
-----
The previous discussion once again reveals our old friend -- the
many-sided implicit cross-referencing process of projection,
correlation, and rejection.
? - Raises Question - ?
-----------------------
Where does this lead us?
foil 29
Epitome of "Command and Control"
--------------------------------
Nature
------
• Command and control must permit one to direct and shape what is to
be done as well as permit one to modify that direction and shaping by
assessing what is being done
What does this mean?
--------------------
• Command must give directions in terms of what is to be done in a
clear unambiguous way. In this sense, command must interact with
system to shape character or nature of that system in order to realize
what is to be done;
whereas
* Control must provide assessment of what is being done also in a
clear unambiguous way. In this sense, control must not interact nor
interfere with system but must determine (not shape) the
character/nature of what is being done
Implication
-----------
* Direction and shaping, hence "command", should be evident while
assessment and determination hence "control", should be invisible and
should not interfere -- otherwise "command and control" does not exist
as an effective means to improve our fitness to shape and cope with
unfolding circumstances.
foil 30
Illumination
------------
• Reflection upon the statements associated with the Epitome of
"Command and Control" leave one unsettled as to the accuracy of these
statements. Why? Command, by definition, means to direct, order, or
compel while control means to regulate, restrain, or hold to a certain
standard as well as to direct or command.
• Against these standards it seems that the command and control (C&C)
we are speaking of is different than the kind that is being applied.
In this sense, the C&C we are speaking of seems more closely aligned
to 'leadership' (rather than command) and to some kind of 'monitoring'
ability (rather than control) that permits leadership to be effective.
• In other words, leadership with monitoring, rather than C&C, seems
to be a better way to cope with the multi-faceted aspects of
uncertainty, change, and stress. On the other hand, monitoring, per
se, does not appear to be an adequate substitute for control. Instead,
after some sorting and reflection, the idea of 'appreciation' seems
better. Why? First of all, appreciation includes the recognition of
worth or value and the idea of clear perception as well as the ability
to monitor. Moreover, next, it is difficult to believe that leadership
can even exist without appreciation.
• Pulling these threads together suggests that 'appreciation and
leadership' offer a more appropriate and richer means than C&C for
shaping and adapting to circumstances.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.arch From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: Re: talk to your I/O cache Date: Fri, 25 Mar 1994 16:01:26 GMTThere is another type of I/O "cache" optimization that I worked on in the late '70s. It basically was an extension of the "page migration" work that I had done early (moving virtual memory pages around at different levels in a disk-performance hierarchy ... analogous to file HSM migration).
In traditional "cache" operations, reads are "non-distructive". However, in situations where there the I/O cache line size and read record size are the same and there is a large processor real memory cache .. then a case can be made doing "distructive" cache reads (which effectively makes everything in the real memory cache "changed").
In the "non-distructive" read situation, there will be "duplicate" records in the I/O cache and the real memory cache. The total number of cached records is equal to the number of records in the I/O caches plus the number of records in the real memory cache MINUS the number of duplicates (i.e. the same record exists in both the I/O cache and the real memory cache).
For configurations with large real memory caches, it is possible that the number of these "duplicates" can approach the total I/O cache capacity. In such a situation the I/O cache is typically reduced to doing little more than optimization associated with rotational latency (and doesn't really need to be any larger than the number of bytes on a track). Even in that situation, some of the "hit" numbers can be deceptive i.e. program is sequentially reading single record at a time, and say there are 10 records per track and there is full-track buffering ... the first record read brings in the track which counts as a "miss" but the next 9 record reads will be counted as "hits" for a theoritical cache-hit ratio of 90%. In this environment, "cache" sizes larger than a track will not increase the hit-ratio beyond 90%.
Sorry for the side-track, back to the "dup"/"no-dup" scenerio. As long as the number of "duplicates" are small with respect to the size of the I/O cache, then non-distructive reads are fine. However, when the number of duplicates approach a significant percentage of the I/O cache size, then some optimization can be achieved by switching to "distructive" reads and a "no-dup" policy.
The page-migration scenerio from the late '70s came about because of the growth in real memory sizes that approached or exceeded the size of high-speed fixed-head disk paging devices. Say that there was 128mbytes of page disk capacity and 64mbytes of real memory ... in a "dup" strategy the largest total amount of virtual pages is 128mbytes. However, with a logical "distructive" read (in the page scenerio anytime a page is read into memory, its disk backing store location is deallocated, if the page is ever "replaced" in the future, it must be written) and a "no-dup" algorithm, the total amount of virtual memory increases to 196mbytes=128mbytes+64mbytes. In this scenerio there is a tradeoff between I/O activity and available virtual memory capacity (note there must be at least 1 disk page slot held in reserve to avoid a deadlock scenerio).
I/O caches represent a similar opportunity ... assuming the cache-line
size and the record transfer size is the same. There is an optimization
which requires that there be two-types of "writes" tho:
1) standard write which implies place the record in the
cache as well as force it to disk
2) cache-only write ... which just places the record in
the cache ... but doesn't force it to disk (and allows it
to be discarded if selected for replacement).
as well as "distructive" and "non-distructive" reads (i.e.
non-distructive read leaves the record in the cache, distructive read
invalidates the cache line and makes the space available). Note the
cache-only write also handles the scenerio of indicating to the I/O
cache that the information is already out at the specified disk record
location.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.arch,alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: lru, clock, random & dynamic adaptive Date: Fri, 25 Mar 1994 16:52:42 GMTThis is a followup to both my i/o cache posting in comp.arch and my (earlier) scheduler/dynamic adapting posting in alt.folklore.computers.
I had originated CLOCK in the late '60s ... but in the early '70s stumbled across and very interesting variation on clock. From our instruction/storage traces and replacement simulatorike 10% of true-LRU.
However, neither clock nor true-LRU handled the case well where the working-set was slightly larger than the cache (memory) size ... or an application effectively exhibited page-reference activity much larger than total available memory. In a global environment this not only had the effect of wiping any of the local application's re-use ... but also would wipe all other application pages from memory. In effect both clock and true-LRU degenerated to FIFO under such stress conditions and had no-page-reuse characterisitics.
The variation that I stumble across in the early '70s had the characteristics of operating like clock under "normal" conditions but had the interesting characteristic of automatically "degenerating" to RANDOM (rather than FIFO) under stress conditions (the pathlength was also effectively the same as clock in UP configurations and actually better than clock in SMP configurations).
In scenerios where page-reference patterns were strictly sequential with no re-use, RANDOM, FIFO, and LRU would all perform the same. In scenerios where the page-reference patterns involved "loops" larger than available cache/memory, FIFO/LRU guaranteed that there would be no page reuse ... whereas under the same stress conditions, RANDOM allowed for a high-probability of page-reuse.
In normal scenerios, normal clock exhibits large degrees of natural dynamic adaptation to all sorts of load & configuration along with relative short/minimal pathlength. However, it still suffers from the clock global LRU tendency to degenerate to FIFO (and no page-reuse) under stress conditions. The "adaptive-variation" clock that I stumbled across in the early '70s avoided this pitfall (and could significantly outperform true-LRU in the simulator involving "stress" workloads).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: REXX Date: Sun, 27 Mar 1994 01:28:25 GMTThis is a REXX story from the early '80s.
In 1982 REXX was still in its early incarnations and there were efforts to get it released to the world as a product. Some of the nay-sayers were claiming that it was just another batch command language ... which the world already had plenty. Being part of the true-believers I wanted to do a demonstration that showed that it was significantly more than another batch command language.
I selected as a demonstration a replacement of a VM product component that was currently implemented in 370 assembler. The existing product was called DUMPSCAN and it contained >20k lines of assembler code and was used to view CP and CMS postmortem storage image dumps (and had a full-time department of 5-10 people supporting it).
My demonstration was that in 3 months elapsed, working half-time, I
would create a REXX replacement for DUMPSCAN that had 5* the function
and ran 5* faster (REXX is an interpreted language). The initial part
of the demonstration was completed in a little over 2 months ... it
had a very small assembler stub module (couple hundred lines of code)
that provided some low-level primitive functions for "DUMPRX". The
actual replacement was 2200 lines of REXX code that implemented a
large superset of the DUMPSCAN function and would operate 5* faster
(with a side-effect for those familiar with the OCO issue was that
effectively nearly all source code had to be shipped). Some of the
enhancements:
• "opcodes" formated storage display
• display storage as addresses with respect to
kernel symbol table.
• some simple psuedo-assembler code written in REXX could
process source include files and perform "source" formated
display of storage locations
• handle not only postmortem storage dumps but also work
against live cp & cms kernel
• parse the GML source file for messages&codes manual and
display information of interest.
br>
• save/log complete session
• sophisticated high-level "help"
Since I still had almost a month left on the product ... I produced
nearly another 800 lines of REXX code that implemented
expert-system-like analysis of postmortem storage images.
It became relatively successful ... although never released to customers as a product. I directly distributed the application to over 100 internal locations world-wide and at least at one time was in use by all internal locations as well as all (VM) field service people.
In support of this, I also made a minor modification to CP kernel to maintain symbol entry-point table. The "standard" DUMPSCAN process was to merge a "saved" loadmap (generated when the kernel was built) with the dump storage image.
The CP kernel build process (not really changed since 1967) was to use the "BPS" loader to load into memory all the kernel binaries. Then a kernel application would receive control from the BPS-loader and write the storage-image to a special disk boot-location.
In 1969, I was started playing around with some enhancements to the CP kernel to allow part of it to be "unpinned" and allow it to page. As part of this I also modified the boot-build routine. When the BPS-loader exits to the loaded program, it passes the address of the loader symbol table as well as the count of entries in the table. The modification I made to the boot-build routine was to copy the the BPS-loader symbol table to the end of the kernel core image and write it out as part of the boot-image. Since it was located in the "pageable", unpinned portion of the kernel it wasn't going to take up runtime storage (there were some 360/67s out there that only had 512kbytes of real storage ... fixed kernel size was becoming critical).
In any case, I went back and resurrected the 1969 modifications and re-applied them to the '82 cp kernel (appending the BPS-loader symbol table to the end of the CP kernel boot-image). I then added the appropriate switch to DUMPRX to utilize the real symbol table if available. This eliminated the problem of getting entry symbols from a "load-map" that didn't match the boot-file..
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: 360 "OS" & "TSS" assemblers Date: Sun, 27 Mar 1994 17:04:33 GMTCP, CMS, and much of CMS application code was written in 360/370 assembler for both CP/67 (360) and VM/370 (and used a ported version of the mainstream "OS" assembler ... in fact, when our location received its first version of CP/67, the CP source was still being assembled and built on OS PCP).
In '74, I had written a PLI program to analyse 370 assembler listings. One of my pet peeves at the time were system failures involving "uninitialized" address registers. The PLI program parsed the assembler listing into individual machine instructions and interpreted the instructions ... including register load, store, & ref'ed activities.
The analysis created "simulated" code blocks from the parsed listing information. A "code block" was defined as a set of sequentially executed instructions which was started by a non-branch that was the next instruction after a branch instruction (branch target or fall-thru). A code block was terminated by a branch-instruction or because the following instruction was the target of a branch instruction. For each code-block a register usage map was created that showed:
The analysis code that would following all possible paths through the code blocks creating summary register activity maps for each path. It would also identify "dead-code" (code-blocks that were never gotten to).
There were numerous possible assembler coding techniques that the code-block building couldn't handle ... but these were relatively rare in the CP and CMS routines (and I handled by "fixing" the code and then rerunning). Most of the incorrect handling would result in mis-identifying sections as "dead-code".
Post-processing involved generating a pseudo-code representation of the original assembler program (looked something like C). If/then/else/while/until/etc. structures could effectively be determined directly from the code-blocks. The post-processing could handle nested conditional structures to any depth, but in most cases going more than 4-5 deep resulted in less-readable ... rather than more-readable code.
Except for the conditional control structures, it was somewhat a trivial one-for-one translation between a machine-op instruction and something that looked like pseudo code. It did attempt to maintain a stack of operations for each register and defer generating the pseudo-code. For instance a load-a/add-b/store-a could turn into a a=+b ... rather than r=a/r=r+b/b=r.
The hard part on the pseudo-code generation was making sure the symbolics were correct. A standard 360 assembler output line looked something like:
address instruction addr1 addr2 original statement yyyy ooiiiiii a1a1 a2a2 a b c d e f g h gGiving the relative address of the instruction in the program, the actual instruction (360/370 instructions could be 2bytes, 4bytes, or 6bytes), and the addresses of locations used by the instructions (360/370 instructions could be register/register, register/storage, or storage/storage, i.e. 0, 1, or 2 storage location addresses). The parsing of the address fields and the original statement in an attempt to formulate a reasonable pseudo-instruction was somewhat problematical.
All 360/370 instruction storage addresses are displacements with respect to value in an address register. Things to be addressed are defined symbolicly to the assembler with something called "csects" (typically program code) and "dsects" (typically include data structure definitions). The assembler is informed about possible address register contents with a "using" statement. The fields "addr1" and "addr2" are (non-displacement) addresses within some csect or dsect structure. The problem was that which csect/dsect the address came from was not identified.
The previous output format was for the line of mainline "OS" 360/370 assemblers. However, generating pseudo-code from the "TSS" (time-sharing system, the "official" operating system support for the 360/67) assembler was less problematical. The TSS assembler would prefix the addr1 and addr2 fields with a "space" identifier which uniquely mapped to a specific csect/dsect. As long as a reasonable convention of symbolic field usage was followed, the TSS assembler-output eliminated ambiguities as to what to generate for symbolic parameter names in the pseudo-code.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.arch From: lynn@netcom7.netcom.com (Lynn Wheeler) Subject: Re: talk to your I/O cache Organization: NETCOM On-line services Date: Sun, 27 Mar 1994 20:19:25 GMTnote that the architecture can somewhat be reduced to that of a multi-level storage implementation (real storage, I/O caches, and storage). Manager in the processor would have capability of specifying logical "copy" operations (read from disk, thru cache, write to disk thru cache) or logical "move" operations (discard copy once transfer is complete).
Transfer operations would have flavors of copy/move as pertaining to specific levels in the storage hiearchy (i.e. like read from disk or cache and discard cache copy).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From: lynn@netcom9.netcom.com (Lynn Wheeler) Newsgroups: comp.arch,alt.folklore.computers Subject: Re: lru, clock, random & dynamic adaptive ... addenda Date: Sun, 27 Mar 1994 23:53:20 GMTin response to number of inquiries regarding additional details:
global LRU work i did as an undergraduate in late '60s was basically a 1bit, 1hand clock. work in the 70-72 time-frame involved
>1bit
1hand & 2hand clock
variation on clock ... lets call it clock-v
since the hardware only had 1bit, additional 8bits were simulated in software (1 byte per real page).
clock-V variation required at least 2bits ... although greater than 2bits didn't seem to make much difference.
the csc simulator could handle a number of different algorithms (for which it was calibrated against live implementations) as well as STRICT/TRUE LRU (i.e. maintaine strick lru page ordering ... clock only approximates strick page ordering).
both 1hand & 2hand clock-v had the characterisitic of degenerating to random under stress (compared to standard clock global LRU that effectively reduces to FIFO). Under typical conditions, 1-hand clock and 1-hand clock-v were nearly identical (i.e. somewhat worse than strick, true LRU). In addition (at least in the simulator), it turned out that for any given load ... it seemed like it was alwas possible to find a variation on 2hand clock-v that would outperform strick/true LRU (even under normal loads).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom7.netcom.com (Lynn Wheeler) Subject: cp disk story Organization: NETCOM On-line services Date: Mon, 28 Mar 1994 00:59:34 GMT Lines: 36About the time I was working on the eight 2-way cluster activity (late '70s) .. I was also playing around over in the disk engineering & test labs.
The disk & test labs split everything into "test-cells" ... a typical lab. room would have multiple test-celss and one or more mainframe processers. Operation of the hardware in a test-cell was either totally stand-alone ... or would be cabled (one at a time) to a mainframe and custom, test software would be executed (typically little more than BPS system with some custom I/O execise software).
Typically a single test-cell configuration would generate guantity and/or severity of errors that would crash &/or hang any of the standard operating systems within 15-30 minutes.
As a hack, I redid the CP I/O subsystem to be bullit-proof so that multiple test-cells could be attached and operated concurrently. Over time, the engineering and test labs also migrated much of their time-sharing and IS processing to these processor complexes.
There was one weekend where some of the test-lab people thot that they had an almost ready new disk controller. They swapped the controller for one that handled a string of 16 disk drives used for standard time-sharing service. Late Monday morning I was getting calls asking what had I done to their system that resulted in a several hundred percent performance degradation. There had been no software changes over that weekend ... the only change had been the controller swap. It turns out that the almost ready controller had a peculiar bug that prevented it from efficiently handling lots of concurrent I/O activity. Normal mid-morning load would have resulted in concurrent I/O on all 16 drives.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.arch.storage From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: Dual-ported disks? Date: Wed, 30 Mar 1994 05:20:23 GMTcouples minor points:
• bsd tahoe&reno (at least) have an interesting feature mapping ip->MACaddress, after return from calling arp cache routine, the ip address and the MACaddress is saved. Next time in, if the ip addres matches the saved address ... the call to arp cache lookup is bypassed. For some (possibly pathelogical) cases, say involving a client communicating exclusively for extended periods to a single server the IP-address won't change. arp cache time-out isn't sufficient. one possible workaround is periodically pinging two different ip-addresses (to reset arp-cache call). In effect, there is a "hidden" single-entry arp-cache value that doesn't conform to the arp-cache rules.
• no single point of failure disk situation ... to effectively handle this two disks with mirrored data and (at least) two controllers (if not 4) are required. MTBF is typically much less for rotating mechanical media so redundant disks are more of an issue than controller electronics for handling various failure modes and affecting aggregate system MTBF. dual-disk controllers are for the same machine ... but so that the same disk can be attached to different machines (handling software, processor complex failures). Handling disk failures requires mirrowed disks (or no-single-point-of-failure RAID attachments) each with their own pair of controllers.
• no single point of failure also needs to handle some pathelogical conditions. one of the best-known is the "stalled-processor" scenerio. there is some agreed upon protocol that all processors agree on that is used to establish the "right to do a disk write" ... one of the processors optains the "disk write priviledge" and stalls. The other processors in the complex decide that the processor is dead and reconfigures the complex (backing out and removing the "dead" processor from the configuration). The "stalled" processor comes back to life and attempts to finish the write operation. To handle this scenerio requires more than just a loosely-coupled distributed protocol and a reconfiguration protocol ... but also requires a "fencing" mechanism that is free of race conditions.
these are fairly easy ... lets try a Hippi or FCS switch configuration. It may require two such switches for no-single-point-of-failure. It also requires mirrored disk or no-single-point-of-failure RAID controllers that are at least dual-ported (one to each switch). Two processors each thinking the other has died ... both of them attempt to reconfigure, and the very first thing in reconfiguration is to fence out the (other) "failed" processor. The fencing must be done at both switches and must be done in such a way that it is race-condition and deadlock free.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From: lynn@netcom11.netcom.com (Lynn Wheeler) Newsgroups: comp.arch.storage Subject: Re: Dual-ported disks? Date: Wed, 30 Mar 1994 15:02:31 GMTfor 2-way solution ... various disk "reserve" commands fences out the other processor ... but the architecture doesn't scale ...
the typical device "reserve" semantics say lock-out everybody but "me". The "fencing" semantics requires only the "presumed" failed processor(s) is fenced. For 2-way the effects are the same. For >2-way reserve and fencing don't have the same semantics.
I attended some HIPPI meetings in the late '80s (before DEC complained about its name and got it changed) advocating fencing in the switch architecture. They were going to do it ... but I haven't followed that in a long time. Pairs of such switches are little bit harder to make sure configuration is "race" free. FCS is another matter. Anybody know if fencing also got into FCS switch?
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: CP/67 & OS MFT14 Organization: NETCOM On-line services Date: Sun, 3 Apr 1994 17:51:11 GMTIn response to various inquiries, attached is report that I presented at the fall '68 SHARE meeting (Atlantic City?). CSC had installed CP/67 at our university in January '68. We were then part of the CP/67 "announcement" that went on at the spring '68 SHARE meeting (in Houston).
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
OS Performance Studies With CP/67 OS MFT 14, OS nucleus with 100 entry trace table, 105 record in-core job queue, default IBM in-core modules, nucleus total size 82k, job scheduler 100k. HASP 118k Hasp with 1/3 2314 track buffering Job Stream 25 FORTG compiles Bare machine Time to run: 322 sec. (12.9 sec/job) times Time to run just JCL for above: 292 sec. (11.7 sec/job) Orig. CP/67 Time to run: 856 sec. (34.2 sec/job) times Time to run just JCL for above: 787 sec. (31.5 sec/job) Ratio CP/67 to bare machine 2.65 Run FORTG compiles 2.7 to run just JCL 2.2 Total time less JCL time 1 user, OS on with all of core available less CP/67 program. Note: No jobs run with the original CP/67 had ratio times higher than the job scheduler. For example, the same 25 jobs were run under WATFOR, where they were compiled and executed. Bare machine time was 20 secs., CP/67 time was 44 sec. or a ratio of 2.2. Subtracting 11.7 sec. for bare machine time and 31.5 for CP/67 time, a ratio for WATFOR less job scheduler time was 1.5. I hand built the OS MFT system with careful ordering of cards in the stage-two sysgen to optimize placement of data sets, and members in SYS1.LINKLIB and SYS1.SVCLIB. MODIFIED CP/67 OS run with one other user. The other user was not active, was just available to control amount of core used by OS. The following table gives core available to OS, execution time and execution time ratio for the 25 FORTG compiles. CORE (pages) OS with Hasp OS w/o HASP 104 1.35 (435 sec) 94 1.37 (445 sec) 74 1.38 (450 sec) 1.49 (480 sec) 64 1.89 (610 sec) 1.49 (480 sec) 54 2.32 (750 sec) 1.81 (585 sec) 44 4.53 (1450 sec) 1.96 (630 sec)xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
MISC. footnotes:
I had started doing hand-built "in-queue" SYSGENs starting with MFT11. I would manually break all the stage2 SYSGEN steps into individual components, provide "JOB" cards for each step and then effectively run the "stand-alone" stage2 SYSGEN in the standard, production job-queue.
I would also carefully reorder the steps/jobs in stage2 (as well as reordering MOVE/COPY statements for PDS member order/placement) so as to appropriately place data on disk for optimal disk arm-seek performance.
In the following report, the "bare-machine" times of 12.9 sec/job was typically over 30 seconds/job for a MFT14 built using standard "stand-alone" SYSGEN process (effectively increase in arm-seek elapsed time). Also, the standard OS "fix/maintenance" process involved replacing PDS-members which resulted in destroying careful member placement. Even with an optimally built system, "six months" of OS maintenance would result in performance degrading to over 20 secs/job.
A non-optimal built OS system actually would make CP/67 performance look "better" (i.e. ratio of CP/67 times to "bare-machine" times). CP/67 overhead (elapsed time increase) was proportional to simulation activity for various "kernel" activities going on in the virtual machine. I/O elapsed time was not affecting by running under CP/67. Keeping the sumulation overhead fixed, but doubling (or tripling) the elapsed time with longer I/O service time would improve the CP/67/bare-machine ratios.
The modified CP/67 was based on numerous pathlength performance changes that I had done between Jan of 1968 and Sept of 1968, i.e. reduce CP/67 elapsed time from 856 sec. to 435 secs (reduction in CP/67 pathlength CPU cycles from 534secs to 113secs).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.arch.storage From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: Dual-ported disks? Date: Sun, 3 Apr 1994 18:06:30 GMTa minor example regarding clusters & availability was a hypothetical situation involving 1minute of down-time per year.
various hardware fault-tolerant solutions would provide the hardware availability but various systems investigated had done nothing about some various mundane aspects of system operation. One was installing a new version of the operating system ... requiring a minimum of a 1 hour outage. With a system upgrade on the order of one per year, each year there was the equivalent of 60 years of downtime.
clusters handled the opportunity by each processor complex having its own private system disks. individual processors could be removed from the complex (w/o taking down the service) and upgraded, tested and then restored to production operation.
at least in that sort of distributed, server environment, cluster operation could mask both hardware and software outages (scheduled and unscheduled).
To the extent that the fault-tolerant vendors invest in handling/masking the software downtime scenerio ... they are also providing the ability for software to also mask hardware failures.
Clustering isn't trivial SMOP ... but neither has been the redundant array of inexpensive disk efforts ... but the benefits are similar.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers,comp.arch From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: Re: CP/67 & OS MFT14 Date: Sun, 3 Apr 1994 19:45:58 GMTnote that in later years in order to emulate/virtualize various "relocate" operating systems (DOS/VS, VS1, SVS, MVS, etc), CP had to (effectively) emulate the TLB (table look aside buffer). CP had two basic components ... 1) virtual machine real hardware emulation 2) shared resource management. For #1, CP had to implement a soft-analogy of many hardware states/functions/capabilities.
Virtualizing a "relocate" operating system provided an interesting challenge. All the state things were relatively step-by-step relatively straight forward.
However, LRU replacement algorithms presented an interesting challenge. Many operating systems implement various types of LRU virtual page replacement algorithms based on some observed generalogies about program behavior ... i.e. pages that haven't been used in a long while are least likely to be used in the near future.
A virtualized, relocate operating system however can easily violate that premise. From a virtual relocate operating system standpoint, what it believes to be "real memory" is actually what CP manages as virtual memory. While the virtual relocate OS is running thru its "real memory" looking for virtual pages to replace and re-use the "real page" location ... it is actually running thru CP's virtual memory specifically search for the least used page to be the very next page to use.
In effect a virtualized LRU page replacement algorithm actually exhibits behavior the opposite of the underlying LRU assumptions ... instead of the leased-used page being the least-likely to be used next ... the least used page is just the opposite ... it is the one that is most likely to be used yet.
Therefor LRU page replacement algorithms don't recurse (&/or virtualize) gracefully (a virtualized LRU page replacement algorithm violates underlying basic program execution & page-reference pattern assumptions).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: 370 ECPS VM microcode assist Date: Sun, 3 Apr 1994 22:55:56 GMTIn May of '75, some people from the Endicott programming lab came to Cambridge looking for advice as to microcode acceleration for a new machine they were building. They had some available microcode store on the machine and they were looking at things to "sink-into" the hardware to improve system performance. Likely candidate was kernel pathlengths.
I got together with Bob Creasy and we built a instrumented kernel and ran various tests ... accumulating the following profile as to the (then) kernel pathlength behavior.
The instrumentation inserted events to create time-stamp records at various points in the code. At the start of the benchmark, the time-stamp process was looped 10,000 times to calibrate the time-stamping overhead.
The resulting data was reduced pairing up various time-stamp records to account for functional elapsed time between the time-stamps (minus the fixed overhead of doing the time-stamp).
The following is the results from that first run which was provided to Endicott for selecting kernel functions for migration to "hardware"; there were 6000 bytes of microcode space that was available for sinking CP kernel function into the hardware. The "79.55" accumulated percentage represents the approximate equivalent of 6000 bytes of 370 machine code.
"path" is the three-character module name (w/o the "DMK" prefix) and the byte displacement within the module.
path count time percent (mics) cp dsp+8d2 to dsp+c84 67488 374. 9.75 from 'unstio' end to enter problem state prg+56 to prv+46 69848 232 6.27 from prog. interrupt to priv. simulation ccw+33e to ccw+33e 64868 215 5.38 loop in ccw calling page lock fre+5a8 73628 132 3.77 'FRET' ccw + f4 to ccw +33e 45297 213 3.73 from initial 'FREE' call to page lock call dsp+4 to dsp+214 84674 110 3.61 main entry to start of 'unstio' ptr+a30 124502 75 3.59 unlock page ccw + 33e to '3' 44839 207 3.58 from lock page to ticscan return ios+20 19399 474 3.55 dmkiosqv (before alternate path finding) fre+8 73699 122 3.47 FREE IOS+1c2 to DSP+4 27806 208 2.23 call SCN(real) until DSP entry (after I/O int) dsp+4 to dsp+c84 15105 374 2.18 asysvm entry until enter prob state sch+4 23445 221 2.00 ios+108 to ios+1c2 27952 165 1.78 I/O interrupt to call scn(real) scn+84 84359 54 1.76 dsp+93a to dsp+c84 11170 374 1.62 sch call to entry problem mode prv+46 to dsp+b8 20976 199 1.61 non-I/O priv. instruction to new psw DSP entry ccw+1252 to EXIT 26212 156 1.58 ticscan return to exit vio+13a to ccw+0 19405 191 1.43 v.sio, ioblok free call until ccwtran call vio+1d0 to ios+20 19399 181 1.36 ccwtran return to DMKIOSQV call ios+0 8423 416 1.35 DMKIOSQR vio+3e to VIO+13a 19405 169 1.27 vio entry(for sio) to 'FREE' call dsp+214 to dsp+8d2 70058 45. 1.21 'unstio' with no calls vio+992 to unt+5a 19410 157. 1.17 ccw+28a to fa (via FREE) 26140 107 1.08 ticscan return till loop back for next block unt+9e to 116 (FRET) 44694 60. 1.03 unt+9e to 9e (PTR+A30) 65092 38 .97 (79.55 cumm.) unt+116 to exit 19407 118 .89 from FRET call to EXIT vio+4 to 3e (SCN+84) 45240 49 .86 vio entry until scan call for vdevblok vio+3e to dsp+4 25504 86. .685 from SCN call to DSP (non-SIO) SCN+4 27979 69 .75 real I/O scan (most IOS+1c2) dsp+214 to 4ce (SCN+84) 14637 126. .72 'unstio' until scn call--
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: CP spooling & programming technology Date: Tue, 5 Apr 1994 16:31:59 GMTThis is part of a CFP announcement that I broadcast in Dec, 1981 for an advanced technology conference (that I ran in March 1982):
TOPICS • High level system programming language • Software development tools • Distributed software development • Migration of CP functions to virtual address spaces • Migration to non-370 architectures • 370 simulators • Dedicated, end-user systemThe objective of the conference was to address the rate at which the existing product could adapt to hardware and other environmental changes ... i.e. the technology rate of change was increasing and the software technology was not able to track that rate of change, nor the increases in the rate of change ... right out of Boyd's OODA-loop).
In some sense this was an attempt to respond to the "UNIX" opportunity. At the time (and to some extent still), the UNIX operating system wasn't competitive other than its characteristic to be adaptable to different & changing environments (hardware, architectures, requirements, etc). The conference included UNIX paper.
At the time, we had a dearth of advanced technology conferences. The prior one had been held six years previously. At that conference our 16-way MP project was on the agenda as well as the "801" project.
The period between the CFP and the conference was also the period during which I was conducting the REXX/DUMPSCAN demonstration (see prior posting).
The "migration of CP functions to virtual address spaces" was effectively to restructure CP into even more of a micro-kernel than it already was. At the time, the CP kernel consisted of around 190 source modules and 250k machine instructions ... all operating within a single protection domain.
One of the "migration" demonstrations that I did was the CP spooling function ... re-implementing it in PASCAL (I didn't have a 370 C compiler at the time), migrating most of the function to a virtual address space, and extending/improving a lot of the function.
During the early to mid-80s, I was also running a skunk-works project that I called HSDT (high-speed data transport). HSDT had a deployed pilot with a number of high-speed terrestrial and satellite links (HSDT included designing/deploying a high-speed digital TDMA satellite system and a double-hop digital broadcast system). For the project I also implemented various drivers for both "bitnet" and "ip" protocols.
For "bitnet" throughput, the CP spooling system represented a serious performance limitation when driving high-speed links (or even driving lots of low-speed links). The "bitnet" file transfer protocol was store-and-foward with nodes using the CP spooling system for intermediate storage. The CP spooling system used a synchronous, per process serialized 4kbyte-block transfer semantics. The "bitnet" protocol operated as a single process (independent of the number of links being driven). Under heavy load, the spooling interface might limit the "bitnet" process to 15 4kbyte-block transfers/second (along with holding the "bitnet" process the majority of the time in blocked state). Also, because of the spooling systems transaction logging at "file" boundaries, if lots of "small" files were being processed, the thoughput would drop even lower.
The "rewritten" spooling subsystem eliminated nearly all of the thoughput bottlenecks, providing asynchronous interface semantics along with read-ahead, write-behind, and contiguous block allocation (along with multi-block reads/writes) ... while still preserving file-boundary transaction semantics. One of the "harder" parts of the implementation was preserving the standard CP on-disk record format ("bitnet" protocol didn't actually transfer user files, it transferred encapsulated CP spool disk records, and a large part of the HSDT pilot traffic originated and were targeted for standard systems).
There was also an attempt (in the "bitnet" protocol) to close a couple small failure-mode windows that could result in the same file being transferred more than once.
Another opportunity that presented itself for both the "bitnet" and "ip" protocols was the bandwidth*delay products over the high-speed satellite links. They only were on the order of 50-100 4kbyte blocks (not quite today's NII opportunity with coast-to-coast terrestrial 1gbit fiber hitting on the order of 800 4kbyte blocks). However, it was formadable opportunity especially with lots of bursty traffic. The problems that have since been documentated regarding adaptive slow-start windows in bursty environment with large bandwidth*delay products, we had lots of. Addressing the opportunity required implementing "bitnet" and "ip" adaptive rate-based pacing (and once rate-based pacing was established it then became possible to start looking at other types of rate-based algorithms like fair-share).
The design of the packet-boundary encryption mechanism was also interesting and may have caused some heart-burn.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: Re: CP spooling & programming technology Date: Thu, 7 Apr 1994 15:53:25 GMTWith respect to some questions regarding HSDT, I was using a number of things ... but one set of hardware was NSC HYPERchannel adapters ... both for some of the long-haul/WAN interfaces but also for some local intra-cluster transport between local processors (I wish we had them available in the late '70s doing the cluster of eight MPs in the shared disk, single-system-image complex).
Slightly prior to HSDT (and slightly after the 8 2-way cluster work), I had gotten involved designing/implementing remote device support over HYPERchannel. The initial project was to "remote" some 300 "overflow" people from the IMS group to another site ... while still providing "local" access. For the most part this was local 3270 controllers ... but also some channel attached unit record gear.
At the time, the remote-device support from NSC mapped a single remote device subchannel address to a local A220 subchannel address ... downloading the local 370 channel command sequence to the A510 for execution. This represented too much inefficiency for me. One or two A220s easily had adequate performance to handle the job, I didn't need 5-6 of them just to provide a one-to-one subchannel mapping for the remote devices. Besides, the remote site was connected by a T1 microwave link (with two HYPERchannel A710s driving the interface).
Another limitation of their implementation was that the 3272/3274 controllers only handled a single operation at a time, while the A220s were high-performance burst controllers. With the one-for-one mapping design, the A220 would actually be identified as a 372x controller and the operating system would only schedule a single operation at a time ... while the "real" A220 could have simultaneous I/O scheduled on all subchannels.
In any case, I designed a brand-new implementation where I dynamically scheduled/allocate each A220 subchannel address on a per operation basis.
This uncovered a problem with the A710s. In most of the typical "remote" configurations with a single 3270 controller at a remote location and the original implementation ... there would never be more than one I/O operation in-flight at a time. It turns out that while the T1 link is full-duplex ... there were some portion of the A710 adapter that effectively operated half-duplex. It wasn't uncommon for me to be scheduling 10-15 simulaneous operations resulting in high-probability that their would be requests for simultaneous data transfer in both directions. This caused me all manner of problems until I put in some dynamic adaptive restrictive code to really back-off on the number of simultaneous operations in-flight at any one moment (NSC eventually had to replace the 710s with 715s, there was also the full-duplex 720 satellite link adapters).
Eventually all the kinks were worked out and it turned out that the remote users didn't see any degradation in response on their local 3270s. There was however an interesting side-benefit with overall system performance increasing by 10-15%. Prior to remote'ing the 3270s, the controllers were evenly spread across all the channels, sharing them with disk controllers. By high-performance standards, the 3270 controllers transferred data slower than disks (more channel busy time) and had relatively slower controller electronics (lots of slow channel hand-shaking and dramatically more channel busy time). Having the 3270 controllers on the same channels with the disks really impacting the channel availability time for disk activity.
Remote'ing the 3270 resulted in:
1) the 3270 hand-shaking busy time becaming the "problem" of the remote channel emulation of the A510 adapter (and was masked from the mainframe)
2) the actual data-transfer channel busy time happened at the A220 1.5mbyte burst rate
3) the A220 had very low hand-shaking busy overhead.
4) "compressing" all 3270 activity down into a single channel
Subsequently there were some performance adviseries regarding not configuring 3270 controllers and disk controllers on the same channel.
The installation was duplicated in a number of locations. One provided remoting several hundred people in the Boulder field support group. The computer complex building and the new programmer's building were relatively close but on opposite sides of the interstate. Infrared T1 modems were used to connect the computer complex to the programmers. Originally, it was thought that the modems would be prone to high BER during rain and heavy fog. It turned out that the only time that BER got to be a problem was during a snow storm that was so heavy nobody could get into work. However, the infrared modems did have an alignment problem on sunny days. The modems were on poles on the top of the buildings. During the course of the day, the sun would unevenly heat different sides of the building ... which resulted in appreciable movement in the modems (at the top of the poles). The poles had to be relocated and the modems slightly defocused.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom3.netcom.com (Lynn Wheeler) Subject: Re: CP spooling & programming technology Date: Thu, 7 Apr 1994 17:17:25 GMTThere was a "funny" glitch in my 80/81 re-implementation of the remote device support. For various types of A510 errors (basically A510 emulates a mainframe channel and allows direct attachment of mainframe channel controllers), I would map the error back into a simulated channel check for the device (logically the A510 error was the equivalent of a logical channel error).
Some 8-10 years later I got a call from somebody who monitors the industry quality reporting information (i.e. there is a company that gathers from lots of installations the mainframe error reporting information and produces reports by manufacturer and model). It turned out that the machine/model this person was associated with was showing up with an unexpected (alarming?) number of channel errors in the reports.
It turned out that the installations involved had NSC HYPERchannel remote device support and the channel check errors were really coming >from the A51x remote device adapters and then the software driver was reflecting them back to the operating system as simulated channel check interrupts. Most of the errors weren't even really hardware. The mainframe "channel" protocol is basically synchronous. With the A51x operation there is an attempt to simulate this synchronous-bus behavior over a network using basically asynchronous message protocol. Sporadically (even w/o A710s) a race condition would result and there is no choice but to abort and get the system to redrive the operation.
Turns out that switching to reporting/simulating IFCC (interface control check, instead of channel check) would logically kick-off the same operating system redrive/recovery actions ... and not have the (mostly communication/race-condition) errors show up as a black mark in the industry reliability summary reports.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From: lynn@netcom7.netcom.com (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: CP spooling & programming technology Date: Thu, 7 Apr 1994 20:55:37 GMTOne of the biggest problems that HSDT project had with the satellite links was getting permits from various local boards. In one case where a 4.5meter dish was going up about a half-mile from some residential housing, there were number of residents showed up at a hearing to complain about the "radiation" dangers that it represented to them, their children and their pets.
Now the TDMA gear had a 25watt transmitter that nominally ran around 7watts (each station monitored its own rebroadcast signal strength and could automatically up the power-budget in situations like rain-fade ... this was Ku-band). Furthermore, it was a very focused, relatively tight-beamed transmission ... going straight up ... with nearly zero side radiation.
However, all that had no effect. Finally had to do the calculations that if somebody was suspended directly above the dish in the focused transmission beam they would receive less radiation than they were currently getting at home from the local 50,000 watt FM station.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.infosystems.interpedia From: lynn@netcom7.netcom.com (Lynn Wheeler) Subject: Misc. more on bidirectional links Date: Thu, 7 Apr 1994 20:46:50 GMTOne of the areas for bi-directional links is in some stuff with things like domain-specific ontologies.
One example is the NLM's UMLS (unified medical language system) meta-thesaurus. in attempts to address the opportunity of queries against really massive information bases, they've developed a constrained language classification system for much of the medlars stuff (in addition to the "online catalogue/abstracts" having all the words in the various fields index, the entries are also classified using the constrained knowledge/concepts).
The UMLS is available on CDROM and consists of (just the term statistics, not including the inter-term relationships, definitions, etc):
30,123 MeSH (16,760 preferred terms; 130,482 supplemental chemical terms) 23,495 INSERM French translation of MeSH (Main headings and French Synonyms) 12,495 SNOMED II (6,971 preferred terms) 21,293 ICD-9-CM terms (13,119 preferred terms) 5,595 CRISP (4,285 preferred terms) 5,094 LCSH (5,094 preferred terms) 2,619 COSTART (1,179 preferred terms) 1,511 COSTAR (1,511 preferred terms) 905 NIC (336 preferred terms) 776 AI Rheum (687 preferred terms) 604 Neuronames (604 preferred terms) 603 DXPlain (603 preferred terms) 450 DSM 3R (263 preferred terms) 557 CPT (210 preferred terms) 100 NANDA (99 preferred terms) 122 ACR (122 preferred terms) 112 UMDNS (112 preferred terms)This is about 500mbytes in "relational ascii" form.
In this usage, there is a "preferred" classification term that is used in classification/indexing entries. The "preferred" term also points at all its synonyms (as well as all its synonyms pointing at it). It is possible to go both directions ... users can enter queries using non-preferred words/terms and they are automatically translated into the correct classification term for lookup. In addition, the preferred term has pointers back to all its synonyms. In addition, there are numerous bi-directional pointers between related classification terms (like human is a mammal, and mammal has instance human).
This is a fairly large and complex structure when loaded into SNDBMS.
A much more simpler example is the bottoms-up IETF lexicon that I'm
slowly building. I've loaded most of the IETF RFC summary information
as well as the IETF standard process description (i.e. information
found in RFC1600). I then produce a number of reports ... one of them
is similar to the rfc-index (available via ftp/anon from
ftp.netcom.com/pub/ly/lynn ...
NOTE since moved to
https://www.garlic.com/~lynn/rfcietff.htm
... another is the report found in section 6.10 of RFC1600.
Another report is the rfc author index (also available via ftp/anon). Simple bidirectional links are the relationships between RFCs and authors. A slightly complex example is being able to express the concept of people with different name forms. If you look at the rfc author index, there are a dozen or so cases where it appears that a single person has used at least two forms of their name. Given that this can be verified, a bi-directional relationship can be defined between the "preferred" name form and the alternative name form (for the same person). With that piece of information, I can slightly modify the query statement to "ignore" alternative name forms, while including the associated RFCs in the list for the "preferred" name.
A more complex example is that I've loaded most of the "glossary" RFC definitions (like RFC1392) into the same IETF SNDBMS ... besides creating the relationship between the terms and the definitions ... I have also done the cross-relationships between things like "see also". I've also (pretty much manually) gone in and done the acronym<->term relationships.
The lisp code that I have for parsing the RFC header also does a check between words in the title and a subset of the terms/acronyms ... creating the appropriate relationships. With some more clean-up on the lexicon and the word in title relationship ... plus possible some manual insertions between RFC and concept/term ... it would be possible to generate a concept index of the RFCs (in addition to an author index).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: Re: 370 ECPS VM microcode assist Date: Fri, 8 Apr 1994 19:17:39 GMTThe 5-way mp project had started in Jan. of 75 ... and effectively I only had hardware engineers working on it with me. Since I wasn't planning on rewriting VM for MP support in one operation, I designed a staged migration.
As mentioned in prior posting, the CP kernel can be thot of as having two basic components:
1) virtual machine simulation
2) shared resource management (involving a number of subsystems)
The virtual machine simulation is effectively virtualmachine/process specific and somewhat free of MP considerations. With the engineering help, was able to "sink" the majority of the virtual machine simulation pathlengths into the microcode on each processor (sort of a super VMA).
That left the "2" shared resource management of the CP kernel. The traditional Q&D approach to MP'ing a kernel up until then was to install a spin-lock barrier. I was disinclined to take that approach.
What I did instead was to identify that minimal amount of the kernel function that I needed to MP in order to create a bounce lock on the main portion of the kernel ... i.e. if unable to obtain the lock, the request was queued (against the lock) and the processor went off to do something else. It turns out that the minimal amount of the kernel needed to create a "queued" kernel lock was basically the dispatcher and the first level interrupt handlers. To further optimize the machine operation, nearly all of the dispatcher was sunk into the hardware (effectively a number of threaded lists maintained by both the scheduler, the hardware dispatcher, and the "hardware" interrupt handler ... could add/delete/move entries on the lists following certain concurrancy semantics). I also defined a higher level queued-list interface for offloaded much of the page I/O interface into asynchronous hardware co-processors. At the same time, the conplex was targeted for CMS intensive with my page-mapped file I/O changes (which effectively mapped into the page I/O interface which was now offloaded).
Compared to a CMS mixed-mode environment that at the time was running something liked 60% problem state and 40% kernel/supervisor state ... this got the "supervisor" state down to on the order of 10%. The goal (for a five-way) with a kernel lock (even a kernel bounce lock) had to be 20% or less. With only a single processor executing in the kernel at a time ... the most kernel pathlengths that could execute was 100% of a single processor (regardless of whether those instructions were confined to a single processor or spread across multiple processors). At a 80/20 ratio, four processors in problem state would max. out a single processor's worth of kernel/supervisor state. A ratio of 75/25 would eventually reach a point where one processor (in a 5-way) was alwas waiting for work. At a 90/10 ratio, the implementation would suffice up thru a 10-way before work on kernel fine grain locks was required (&/or more asynchronous hardware assists).
By the time the Endicott lab. visited Cambridge in May of '75, there was a two-way prototype with some of the pieces in place.
There was an interestimetime in the fall of '75 where some hdqtrs operation was doing product evaluations/comparisons. For this particular meeting, both the 5-way and the ECPS activity was on the agenda with supposedly the two projects being somewhat setup on opposite sides of the table. I don't think that the hdqtrs types realized when they set it up ... that they had me on both sides of the table simultaneously doing the technical arguments & pros/cons for both projects.
The five-way project was eventually canceled ... but some of the same participants revived it in another place with different hardware as 16-way project (late fall of '75). When this got canceled, went back to just creating a prototype on a vanilla VM system using vanilla 2-way 370MP. The implementation was effectively the same as the 5-way, but without functions sunk into hardware. Effectively, some amount of the virtual machine simulation (i.e. per process), the dispatcher, the first-level interrupt handlers, etc. had to have some level of MP fine-grain lock support to bracket/support an effective bounce lock on the shared-resource management portion of the kernel.
The bounce lock actually represented a cache performance enhancement, at least for CMS mixed-mode environments. The majority of the time some CMS virtual machine operation hit the bounce lock ... it was into some logic that would result in blocked state (requiring the processor to do a task switch). If there was a processor already in the kernel, that processor would have some amount of the kernel pathlength already loaded into the cache. Another processor hitting the bounce lock would create a queue for kernel services and go off an find some other virtual machine to run (the queue request for kernel services was extremely light weight being on the order of 20-30 instructions). The processor executing in the kernel would eventually complete what it was doing ... and then immediately find queued request for more kernel services.
The bounce lock was actually the optimal MP implementation ... both in terms of number of lines of code changes as well as aggregate MP throughput (to say nothing of MP performance per code-lines of MP support) ... at least for CMS intensive mixed-mode environments.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom5.netcom.com (Lynn Wheeler) Subject: Re: 370 ECPS VM microcode assist Date: Fri, 8 Apr 1994 20:46:37 GMTnote that the MP work mentioned in the prior posting didn't directly involve any of Charlie's fine-grain kernel locking work on CP/67 ... which also lead up to the definition of the Compare And Swap instruction (note compare&swap wasn't chosen because it sounded good, it was chosen because it matched Charlie's initials; CAS).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From: lynn@netcom10.netcom.com (Lynn Wheeler) Subject: Re: CP spooling & programming technology Date: Sat, 9 Apr 1994 01:06:41 GMT Newsgroups: alt.folklore.computersi hadn't really wanted to stay on the pascal compiler ... but i knew the person responsible for the pascal compiler and he was within a couple months of a c-language front-end. However, part way into the project I went on a 6-week lecturing tour of Europe and when I got back, I found he had left the company before the c frontend was done (and had gone to a compiler company). so i was stuck (for the time being) with pascal.
however, when it was finally decided to do a BSD port (actually two of them) ... I said "I bet I know where to go to get a compiler" ... so I eventually did get a c-compiler from him.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: Re: CP spooling & programming technology Date: Sat, 9 Apr 1994 19:51:42 GMT... objectives of migrating the spool function to a virtual address space and into a high(er) level programming language wasn't just to make it more manageable and provide higher thruput to the using processes ... but another objective was also to reduce the pathlength. The assembler implementation used sequentially threaded list for spool files ... and typically there were several thousand such files. The rewrite used hash table to look up file by file-id and used red-black trees to maintain owner afinity relationships.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: High Speed Data Transport (HSDT) Date: Sun, 10 Apr 1994 03:04:29 GMTone of my interests in doing the HSDT project ... besides cluster interconnect (follow-on to the eight 2-way project from the late '70s & extending it to WAN environments) was infoban type stuff.
One Friday night in the late '70s, Jim Gray and I were sitting around drinking beer (before hed use. We settled on online directories with a design point that somebody could typing in a name would get an answer faster than if manually looking up the name in a paper directory sitting on the desk. Other project criteria was that it would take no more than two person weeks to design and implement the lookup program as well as the data collection/distribution mechanism and no more than 1 person day per month to maintain the directories.
It was decided to use flat files with slightly modified name sort procedure. The lookup program used a radix search initially seeded with frequency distribution for the first two letters in a name. The first probe would typically calculate a prorated miss-error (what was found verses what was predicted) for use in calculating the second probe. A name could typically found within 4-5 4kbyte disk-block reads (2-3 data blocks and 1-2 indirect blocks). On a 370/158 (about 1mip processor) this was clocked at <100mills total cpu processing.
On a CMS system fixed-length records were used and the frequency distribution was expressed in record number. On Unix system, byte-displacements were used in the frequency distribution table.
Between the directory distributions, software distributions, and various newsgroup "lists" ... my userid would periodically account for 30-40% of total network transmission on the internal network (during a period where it had somewhere around 1000-1500 mainframe nodes?? ... I don't seem to have my network-node timeline handy at the moment).
Many of these newsgroup "lists" in the 79-81 timeframe were having
Infobahn-type threads (that have been somewhat the rage on the
internet recently). There were jokes about newsgroups being
wheeler'ized ... where my postings might periodically account for half
or more of the traffic. I've since tried to become more moderate
(although there was a in-depth, 9month CMC/face-to-face/etc. study of
me during the mid-80s that resulted in a number of research reports
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: Re: painting computers Date: Sun, 10 Apr 1994 16:21:48 GMTit wasn't authorized, but CSC painted the disk drives in its machine room (2nd floor 545 tech sq). There were five 8-drive 2314 strings and a short 5-drive string. Each string was painted a different color so that strings could be referred to by color for mounting/unmounting disk packs.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom11.netcom.com (Lynn Wheeler) Subject: short CICS story Date: Sun, 10 Apr 1994 16:23:06 GMTthe university library had one of those navy operation research grants and gotten selected to be the beta test site for the original CICS (it had been developed at some customer shop in the midwest, Chicago?) and it was in the process of being turned into a product. It turns out that it had been developed with applications using one flavor of BDAM (basic direct access method). The library had selected another flavor of of BDAM for its application ... and it wasn't working. After picking around in the program storage dump for awhile, I discovered that the bits in the DCB weren't "correct". After patching some stuff, I got it past that hurdle.
Some of this is fuzzy after 25 years, but I vaguely remember a problem with shared BDAM also. The library wanted to do both CICS transactions against an online catalogue and also some batch jobs (had upgraded to MVT 15/16 ... and then MVT 18 during this period). There was something about CICS wanting to have all possible files it might access defined with DD cards at CICS startup ... and not gracefully making the transition w/o cycling CICS. I seem to remember trying to mess around with JFCB operations.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers,comp.society.folklore Subject: Re: High Speed Data Transport (HSDT) Date: 11 Apr 1994 03:53:59 GMTIn '85 as part of the HSDT effort I was working with one of the bell companies that had developed/deployed a pu5&pu4 emulation software. At the time, SNA didn't have any networking support ... so as a way of intergrated SNA mainframes into a network environment, I was looking at picking up their pu5/pu4 emulation and installing it at network boundary nodes ... encapsulating SNA RUs in IP packets in order to achieve networking capability for SNA (especially over wan-cloads and backbones).
This wasn't too outlandish ... in a thread over in comp.unix.programming I mentioned that I had redid the CP/67 1050/2741 terminal support to add TTY (circa early '68) ... I had also rewritten the terminal identification sequence so that a common modem pool could be used for all incoming terminals. This all worked ... except that the IBM CE eventually informed me that while the IBM terminal controller had the capability of dynamically assigning which line-scanner was used on a port ... it had an unfortunate feature that the oscillator was hired-wired (i.e. bit-speed determination). This was in part responsible for four of us getting together to build the first non-IBM "OEM" controller for IBM mainframes (and discovering all sorts of things about ibm channels/memory ... like holding op-in for longer than the 13+mics timer update frequency ... lock-out the processor from updating the timer value in real storage and brought down the machine ... also that ibm's tty line-scanner reversed the bits before they went to mainframe memory).
The attachment is the opening & closing foils for the description of their pu5/pu4 work ... vis-a-vis straight IBM offering (they had given presentation at '84 or '85 COMMON user group meeting).
Support included high-availability, no-single-point-of-failure, fall-over capability. Normaly pu5/pu4 operation was effectively virtual circuit and if anything broke it took down the session. The pu5/pu4 emulation at boundary nodes would checkpoint the session information at the boundary node ... and had pair-processor fall-over in case of failure (no-single-point-of-failure). With RU encapsulation in IP-packets thru wan-cloads from boundary-node to boundary-node it was relatively insensitive to most failure modes.
Note these setups were typically multi-site with 60k-100k (or more) "keyboards".
As requested, this is cross-posted to comp.society.folklore (in
addition to alt.folklore.computers). For those just tuning in, at
least some of the postings can be found via ftp/anon at
ftp.netcom.com/pub/ly/lynn
(NOTEwhich has since been moved to
https://www.garlic.com/~lynn/)
xxxxxxxxxxxx attachment xxxxxxxxxxxxxxxxxx
• Higher availability
• More reliable
• More function
• Improved Useability
• Non-IBM Host Support
• Much better connectivity
• Much better performance
• Fewer components
• Easier to tune
• Easier to tailor
• Easier to manage
• Less expensive
Newsgroups: comp.society.folklore,alt.folklore.computers Subject: Re: High Speed Data Transport (HSDT) Date: 12 Apr 1994 20:31:02 GMTattached is short extract from a some-what long posting that i made in a competitive technology conference (on the well):
low-speed <9.6kbits medium-speed 19.2kbits high-speed 56kbits very high-speed 1.5mbitsOn Monday morning on the wall of a conference room in Japan was:
low-speed <20mbits medium-speed 100mbits high-speed 200-300mbits very high-speed >600mbits
Newsgroups: comp.arch.storage From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: Failover and MAC addresses (was: Re: Dual-p Date: Fri, 15 Apr 1994 16:02:27 GMTanother flavor of network failover is multihomed support. The stuff in rfc1122/1123 is relatively abbreviated. there was an ietf draft by lakashman, 26apr88 ... but it fell off the draft disk (i could put it out in ftp.netcom.com/pub/ly/lynn if anyone is interested).
the lack of distinction regarding multiple interfaces on routers/gateways and hosts almost lead to crisis at interop '88 when the floor nets kept crashing most of sunday before the show ... given rise to the requirement that hosts are to have ip-forwarding default to off.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers From: lynn@netcom7.netcom.com (Lynn Wheeler) Subject: mainframe CKD disks & PDS files (looong... warning) Date: Fri, 15 Apr 1994 18:52:53 GMTThe "mainline" standard mainframe operating system had a filesystem design that supported something called PDS (partition datasets) files ... basically a sublibrary of file "members" with a special directory at the front of the file (giving pointers to each member). One of the most common uses for PDS files were for program libraries (with each member being a different program). This was somewhat designed in the early '60s along with disk feature called CKD (count-key-data) based on the memory/processor/transfer/disk capabilities & trade-offs of the early 60s.
CKD supported "search-key" operations out in the hardware disk controller. The PDS directory was organized as a structure using the file member names as "keys". While this design was somewhat optimal given the hardware trade-offs of the early '60s ... by the late '60s the relative performance of these different components had changed and the optimal trade-off design had also changed. The characteristic of a "multi-track" search operation involved specifying an I/O operation that pointed at a "key" in the processor real memory and a "cylinder" location to start the search. The search operation then would start scanning all keys (in the example ... a PDS directory area) sequentially, a track at a time ... until EITHER the specifying condition was satisfied (i.e. equal, but things like hi/lo were also possible) OR the end-of-cylinder had been reached. While this search didn't involve any(much) processor cycles it exclusively "tied-up" the channel (nominal a shared resource) and the disk controller (also nominal a shared resource ... as well as the disk) for the duration of the search.
Come the late '70s ... and PDS/search-key were way past the optimal trade-off implementation design ... but they were still being used as the standard/default program library mechanism. At this time I got a customer "performance" problem call at a large retail customer.
The customer had large number of stores across the US and the operation was divided into 12 regions. They had semi-centralized the operation at a single location. At IS hdqtrs there were three 168/370s (about 2.5mip processors) running VM. Running under VM (on each machine) were four "SVS" virtual machines (for a total of 12 such virtual SVS machines across the complex). Each region had its own "dedicated" SVS operation for running its business (cics, transactions, batch jobs, etc).
The "problem" was unidentified extreme performance/thru-put degradation. When I arrived they had a room prepared with tables with mounds of performance/thruput print-outs. I got to sit-down and flip thru these mountds of paper with them telling me (subjective) information about when performance was good and when it was bad. There appeared to be little correlation between the numbers & good/bad thrupt.
After a couple hours of this, a slight suspicion started to occur to me. Now these were relatively stock systems ... and (especially) the VM system had none of our instrumentation enhancements. I/O was basically indicated by count activity by device. There was nothing about mean service &/or queueing time per request. The only correlation pattern that I could eyeball was a specific disk would "top" out with an I/O request rate of around 7/second.
Now nominally these disk could be expected to show activity rates of 30-50/second ... with this rate occurring on several disks simultaneously all sharing the same disk controller and data channel/bus path to processor memory. After more detailed investigation, this particular disk was identified as the place where a very large program library resided that was shared & used by all SVS virtual machines. To be more specific, the PDS directory for the program library occupied three "cylinders".
Now this particular generation of disks spun at 3600 RPM and had 19 tracks (about 12k/track). On average program library look-up operation would scan half the directory (or about 1.5 cylinders). In effect a program load operation required a multi-track search of the first directory cylinder (19 revolutions) followed (on the average) a multi-track search of 1/2 the next cylinder (10 revolutions). At 3600 RPM, a 19 revolution search was "busying" the (shared) device, the (shared) controller and the (shared) channel/bus path to memory for 317 milliseconds (little less than 1/3rd second). The search of the 2nd cylinder would then take (on the avg) 158 millseconds. The two operations would take a combined elapsed time of 475 milliseconds to "find" a program ... followed by one or more "normal" I/O operations to load the program. Effectively two program loads could be done a second ... and when load/activity started picking up with all 12 SVS systems attempting multiple concurrent program loads ... everything would come to a screeching halt.
Once the opportunity was identified, the program library was 1) partitioned (no PDS having directory more than 10 tracks) and 2) replicated.
This isn't the end of the tale. At an account closer to home, there was a dedicated MVS system on a 168/370 (2.5mip) machine and a dedicated CMS multi-user VM system on a 158/370 (1mip). This was organized as a single complex with all the disks controllers cross-connected ("dual-ported") between the two machines. However, the VM service strongly petitioned that actual operational deployment/use of disks (mountable) drives be carefully segregated ... so that strings of "CMS" disks were all on a common set of disk controllers that had NO mounted MVS(PDS) disks (because of the extreme performance impact of PDS multi-track search on disk controller thruput). The MVS "crowd" didn't pay much attention to the petition and/or the fact that multi-track search would affect interactive performance (they were >from the 1-2 second interactive response school and we were from the <.15 interactive response school).
Anyway, one day a MVS operator mounted a MVS/pds disk on a drive managed by a CMS/VM disk controller ... the performance degradation was noticed immediately (by the CMS users). The MVS people still couldn't believe that one little 'ol MVS pack would cause so much havoc.
SO ....
we had a supercharged VS1, optimized for operation under VM ... which also "supported" PDS files. We started the supercharged VS1 system on the 158/VM machine with an application looping loading programs from a MVS disk drive (on a MVS disk drive controller). This VS1 system was 20+ times the performance of the MVS system (i.e. VS1 getting 1/20th of a 158 VM processor shared with interactive CMS users... against a MVS system running on a dedicated 168, having 2.5 times the raw CPU power). In any case, the interference generated by the VS1 system slowed down the MVS system to the point that the (168 MVS) application program(s) accessing the MVS/PDS disk (on the VM controller) was no longer interfering with CMS interactive performance. At that point, the MVS group at least agreed if we stopped VS1'ing their drives/controllers ... they would be more careful about MVS'ing "VM" controllers.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.arch.storage From: lynn@netcom9.netcom.com (Lynn Wheeler) Subject: Re: Failover and MAC addresses (was: Re: Dual-p Date: Fri, 15 Apr 1994 16:02:27 GMTanother flavor of network failover is multihomed support. The stuff in rfc1122/1123 is relatively abbreviated. there was an ietf draft by lakashman, 26apr88 ... but it fell off the draft disk (i could put it out in ftp.netcom.com/pub/ly/lynn if anyone is interested).
the lack of distinction regarding multiple interfaces on routers/gateways and hosts almost lead to crisis at interop '88 when the floor nets kept crashing most of sunday before the show ... given rise to the requirement that hosts are to have ip-forwarding default to off.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
From: lynn@garlic.com (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: SIE instruction (S/390) Date: 28 Sep 1994 15:54:55 GMTSIE is a state change instruction.
way back in the dark ages (mid-60s), CP/40 & CP/67 relied on the standard 360 LPSW (load program status word) instruction to switch between the virtual machine simulator (CP) and virtual machine execution. Back then the machine state was relatively simple, it switched from:
• CP-space, real-address space • "privilege-instruction" mode • CP "instruction address" to • virtual machine, virtual-address space • "problem-instruction" (i.e. user) mode • virtual machine "instruction address"in a single, standard (LPSW) instruction (and various interrupts were defined that performed the symmetrical, reverse operation).
Then along came virtual machine assists ... which were active only in "problem-state". One of the (otherwise) unused control regs had (real) pointer to assist structures. They performed things like 2nd level virtual address translation ... i.e. special page table management that effectively followed the real machine TLB rules. There was an "implied" state switch, that if CR6 was non-zero and the switch was >from priv->prob ... then it also turned on various other assists.
As the machine states got more complex ... and more virtual machine emulation features were added to the hardware ... it became impossible to switch all the necessary machine states in a single, "standard" instruction.
This wasn't particularly a problem for the other IBM SCPs since they have a design point where part of the kernel executable instructions are common to all user-mode address spaces. For instance, it is possible to switch execution modes w/o having to switch address spaces or instruction address. However, for a virtual machine emulator, it isn't possible to have part the emulator's instruction space reside inside the address space of the virtual machine it is emulating.
Since it isn't possible to execute the ->prob transition in a single instruction ... there symmetrical interrupt return to the emulator wasn't also possible ... as a result various state transitions within virtual machine mode now return to the instruction following SIE ... somewhat creating the facade that SIE is a single long running instruction ... although it is more like a hardware call stack.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: IBM 370/195 Date: 23 Oct 1994 15:55:42 GMT... actually 64(?) instruction pipeline that supported concurrent & out-of-order execution (i.e. superscaler) ... and imprecise interrupts. It didn't have branch-prediction ... so a branch would drain the pipeline. For general things this was such a performance penalty that a dual I-stream implementation was investigated as a means of keeping the E-units busy (i.e. from programming standpoint looked like two-cpu SMP ... but effectively just had two instruction addresses, double the register array ... and everything had an extra bit tagging which i-stream it belong to at any moment).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: IBM 370/195 Date: 24 Oct 1994 05:56:56 GMT... oops, 195 did recognize branch targets within the pipeline, i.e. things like bct/bx back to head of loop ... holy grail for programmers doing optimized codes were <65 instruction loops ... which ran totally within the pipeline.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Lynn Wheeler) Newsgroups: comp.os.ms-windows.nt.misc Subject: process sleeping? Date: 13 Nov 1994 03:17:24 GMTanybody seen a problem in 3.5 with process going to sleep?
last week I upgraded my 16mbyte 486dx/50 from 3.1 to 3.5. that seem to go ok ... however loading the sdk cdrom seem to have some sort of cancer ... loading gradually slowed down until it hit about 48% complete and then it was in real slow-motion ... eventually taking around 4 hrs elapsed to install the sdk cdrom. ddk didn't appear to be anywhere near as bad.
the next "interesting" problem was attempting to do makeall in sdk mstools .... at least three times during the makeall ... it would appear to go to sleep ... winperf showed no activity what-so-ever. 3.5 wasn't hung ... I could point & click on any other applications ... but the makeall was out cold. However, if i hit return in the makeall window, it would come back to life and continue (no messages on the screen indicating that anything out of the ordinary occurred ... just like it was in suspended animation until enter was hit). This eventually happened 3-4 times before makeall finally ran to completion.
has anybody else seen process suspended animation under 3.5??
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne&Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: baddest workstation Date: 16 Nov 1994 03:13:43 GMTtry comp.sys.super ... they have regular posting ... old sample:
- (05-MAY-1993) ?,Engineering,Government,Biloxi,Mississippi,US 1) 4 * Cray C916-1024 111.64? - (04-MAY-1993) National Security Agency,Fort Meade,Maryland,US 1) TMC CM-5/1024 59.7 2) 5 * Cray Y-MP8/8-256 34.75? - (26-APR-1993) Los Alamos National Labs,Los Alamos,New Mexico,US,lacomputing@lanl.gov 1) TMC CM-5/1024 59.7 2) 2 * Cray Y-MP8/8-128 13.9 ? 4) 2 * Cray Y-MP8/4-128 6.95? 6) Cray X-MP/4-16 2.45? 7) 2 * TMC CM-200/64K ~ 2.2 9) 16 * IBM RS/6000-560 .4 ? - (14-APR-1993) Minnesota Supercomputing Center,Minnesota,US,consult@msc.edu 1) TMC CM-5/544VU ~31.72 2) Cray C98-256 13.96? 3) Cray 2/4-512 6.15? 4) Cray X-MP/4-64 2.45? 5) Cray X-MP4/EA64 2.45? 6) Cray M92/2-1024 1.16? 7) TMC CM-200/32K ~ .55--
Newsgroups: comp.arch From: lynn@netcom4.netcom.com (Lynn Wheeler) Subject: Re: bloat Date: Fri, 18 Nov 1994 19:27:59 GMTxt370 was originally going to ship with 384k real memory ... but various tests showed significant page thrashing for most activity. It finally shipped with 512k real memory ... and one of my more esoteric page replacement algorithms (from the early 70s) and a page mapped filesystem (that I also had done in the early 70s ... provided certain optimizations ... especially for things like loading 1mbyte binary executable). The 512k real memory need to accommodate the cp fixed storage requirements as well as the pageable cms system services ... in addition to any application code you wanted to execute (i.e. imagine running NT in a 512kbyte real memory).
--
Anne & Lynn Wheeler | lynn@netcom.com lynn@garlic.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Bloat, elegance, simplicity and other irrelevant concepts Date: 27 Nov 1994 19:28:50 GMT... i've posted this before ... but
23-24 years ago, we had 360/67 (.5mips; no cache, & I/O would cause lots of interference on memory bus ... i.e. more like .3-.4mips with i/o) with 768k running cp/cms (105 4k pageable pages after fixed memory requirements) supporting 70-80 users with subsecond response for trivial transactions .... ran word-processing (script ... precursor to gml/sgml/html), edit, program development (assembler, fortran, pli), APL applications (modeling).
... comparing it to system 12 years later (similar interactive profile/response) from report in early '80s:
system 3.1L HPO change machine 360/67 3081 47* (mips) pageable pages 105 7000 66* users 80 320 4* channels 6 24 4* drums 12meg 72meg 6* page I/O 150 600 4* user I/O 100 300 3* # disk arms 45 32 4*?perform bytes/arm 29meg 630meg 23* avg. arm access 60mill 16mill 3.7* transfer rate .3meg 3meg 10* total data 1.2gig 20.1gig 18*... i.e. over period of 12+ years, number of interactive CMS users increased in proportion to I/O thruput NOT cpu capacity. Also typically during the period, E/B profiling switched from MIPS/MBYTES to MIPS/mbits.
Environment did change from line-mode terminals to channel-attached display screens. Around '81, I did a high-performance implementation of NSC HYPERchannel remote-device support ... remoting all the display controllers for a number of systems. There was an unanticipated side-effect which resulted in >15% system thruput improvement. Turned at that the i/o bus busy time for HYPERchannel was significantly lower than same exact activity with direct-attached display controllers (although in some quarters the reaction was similar to the terminal controller that I worked on 15 years earlier that became the origins of the ibm-compatible controller industry).
The reduction in I/O bus busy time resulted in corresponding thruput improvement for disk i/o.
... in any case, I've recently ran into what appeared to be page-thrashing on single user NT3.5 system running on 16mbyte 486DX50 attempting to install latest SDK from CDROM. It got to 48% complete in around 30minutes ... and then appeared to go into page-thrashing mode, taking around 4hrs total to complete.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: bloat Date: 29 Nov 1994 17:08:30 GMTI ran across this description at the time of the government anti-trust suit (in the early '70s) ... I never ran across any validation/repeat, so I don't know if it is real:
1) the "criteria" was based on the cost of software/application programming plus the additional expense of operating without the application (if the application can save $x million per month, then every month the application is unavailable costs the company $x million more than it otherwise needed to spend). These direct and indirect costs/expenses associated with software/applications far outweighed the hardware costs. The state-of-the-art in the late 50s and early 60s was such that the only way that a growing company could grow its software applications (as fast) was to upgrade to a faster machine that was exactly instruction compatible with the machines the applications were currently running on.
2) given that only one company actually met the SINGLE MOST IMPORTANT criteria and none of the other companies did, then in theory that one company could make every subsequent decision absolutely wrong and still retain a competitive advantage (since even in aggregate, none of the other competitive factors would outweigh that single advantage).
... there were misc. other issues related to how human organizations operate given various aspects of success/failure feedback control loops (I was specializing in dynamic adaptive feedback control mechanisms at the time).
Since that time, the dominating factors in the model has changed at least twice. For some time, commerical/industrial strength features that could have only come about because of a large customer base was significant. With the change in the software-state-of-the-art and the reduction in price/performance ... the shift has been towards open (aka commodity-priced) systems. With the per-system profit margin being squeezed ... the magnitude of "large customer base" has changed (i.e. sufficient market segment to fund development of very expensive new features). There has been also a shift in the percentage of the market that is rapidly expanding vis-a-vis percentage that are downsizing.
All of the factors still play but the relative magnitude of the effects are in flux ... with market shifts spanning multiple technology generations. However the duration of technology generation is not only decreasing ... but the rate of decrease seems to be accelerating. In theory, various human factors should cause some friction preventing cycle dropping below some threshold ... although I saw a recent note regarding some of the consumer electronic manufactures are now on a 90 day model cycle.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: alt.fan.bill-gates,comp.os.ms-windows.nt.misc,comp.os.ms-windows.advocacy Subject: Re: SMP, Spin Locks and Serialized Access Date: 04 Dec 1994 07:32:45 GMTsun, vms, irix? ... late 60s there was number of smp kernel implementations with single kernel spin lock. early '70/'71 I worked with Charlie on smp ... which started out as fine granularity spin-locks ... but rapidly evolved into his definition of CAS (compare&swap, sometimes shorten to CS; actually it started out as his initials ... and then contrived to come up with something meaningful for his initials). CAS allowed some number of atomic operations to performed w/o requiring spin-lock bracketing ... and defined the operations for smp operation as well as enabled critical sections whether smp or up.
a couple years later (20 years ago?) we did some specific scenarios with smp bounce locks ... which queue rather than spin (with sufficient l/w threads, it is less expensive than spins ... and provided improved cache hit benefits).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Rethinking Virtual Memory Date: 14 Dec 1994 05:46:52 GMTcp/40 was implemented on a custom modified 360/40 that had 256kbytes of real memory ... 64 4kbyte pages .... custom hardware relocation hardware had 64 v-to-r translation ... one for each real page.
cp/40 was ported to standard product hardware ... 360/67 and was called cp/67. 67 had an eight entry associative array & did LRU replacement on the entries ... for v-to-r translation. the official system product was tss/360 ... including page mapped filesystem (32bit virtual at the time was more than all disks available). it suffered from problem that working set size tended to approach file size (frequently much larger than real memory sizes of the time) ... when reading files and using standard LRU replacement algorithm. it needed much more complex management with something like LRU/MRU switch for different regions of virtual memory.
i did pagemapped variation in early 70s on cp/67 (which never shipped as product except in mid-80s on pc/370 co-processor cards ... xt, at, and later ps/2). Included several "windowing" variations allowing simulation of high-performance multiple buffering techniques in common use by typical production programs of the period (didn't have the simplified paradigm ... but the working set sizes tended to be closer to program size ... than file size; while offering various optimizing techniques).
the transition from 360/67 to 370 went from associative array to set-associative table-lookaside-buffer. 370/168 had 128 entries each with 3-bit "address space ownership" ... and associated 7 entry "STO" (address space id) stack (LRU managed). switching to a different address space/task ... would check the STO-stack ... if it is not found in the STO-stack; would use LRU to replace entry ... and purge all the TLB entries "owned" by the replaced STO-entry.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Rethinking Virtual Memory Date: 15 Dec 1994 19:08:19 GMTsoftware "virtual memory" and no exceptions are somewhat orthogonal.
There was at least one scenerio of hacking OS/MVT (13?) in the late '60s to run in relocate mode on the 360/67 .... it wasn't done to optimize the multiprogramming level (i.e. only keeping the working set around in real memory implies that larger number of tasks can be operated simultaneously) ... but to solve storage allocation fragmentation problem ... and so ran w/o exceptions (i.e. total allocated virtual memmory = real memory). MVT multitasked in a single (real) address space. Each application expected contigously allocated space. Running multiple (long running, on the order of multiple hours) 2250/display graphic applications resulted in all sorts of bad fragmentation effects on dynamic storage allocation. Virtual memory was used to "fix" storage fragmentation problems associated with contiguously allocated storage. In any case, it operated w/o exceptions.
another aspect of "software" virtual memory was the original 801 (i.e. power/pc) design point from the mid-70s. Hardware and software design point had proprietary operating system and inverted pagetable with software TLB load (i.e. TLB miss resulted in exception to the software ... which then had to figure out the virtual=real mapping and load the TLB). The design point called for nominal avg. of approximately 20 instructions to perform the exception & tlb load on the fly.
Another aspect of memmap'ing is the number of "things" that can exist in the address space. I believe the s/38 ... and then the as/400 has single 48bit(?) global address space for mapping everything (globally) ... which says something about the aggregate size of everything.
In a segmented address architecture ... the number of things can be limited by the maximum number of segments supported (orthogonal to the aggregate size). Typical segment implementation has an address space table divided into N segment pointers. Segments tend to also be a unit of virtual object sharing (i.e. multiple different address space tables have a segment pointer to the same segment table). Frequently the segment pointer may have access bits (i.e. allowing some address spaces r/w to an object ... while others only have r/o access to the same object). Segment'ed architectures range from as few as 16 "things" (in a 32-bit address space) to thousands.
In my previous post, i mentioned doing a file mem-mapped implementation in the early '70s ... it also supported a segment sharing implementation. At that time, the issue wasn't so much the number of bits for mapping, but the total number of different things that could be mapped. Two methodologies were commonly used at the time globally-unique/consistent mapping (even with multiple address spaces) and non-globally-unique. I chose non-globally-unique ... which typically entails a little bit more work. However, it avoids the scenerio (especially for shared objects) ... that every object must only have a single virtual address. The non-globally unique approach runs into a brick-wall when a specific application/address-space needs more simultaneous objects than is supported by the segment architecture. The globally unique implementation however starts running into a brick wall when the total number of possible things needed by all applications starts to exceed the number of segments supported (even tho each individual application only requires a very small number).
Final aside, in a multi-space TLB ... the address space ownership tends to be associated/indentified via the (real?)addresses of each (virtual)address-space-table. This becomes a more interesting problem in an inverted-page-table architecture ... since there are no hardware defined address-space tables on which to base ownership. The typical solution is for the architecture to define some sort of arbritrary bit-string identifier that the operating system then uniquely loads based on which application/task is executing. TLB entries then are associated/identified with the bit-string ID ... rather than the address of the address-space-table.
In one situation this has been somewhat used for marketing hype. In that scererio, an inverted-page-table architecture claimed that the actual number of virtual address bits was not 32-bits ... but was the addition of the number of bits in the virtual address (32-bits?) and the number of bits in the address-space bit-string ID (12, 24, etc).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Rethinking Virtual Memory Date: 16 Dec 1994 14:56:58 GMTthere was another "kind" of software virtual memory. cp/67 implemented virtual machines ... and even allowed a virtual copy of itself to run in one of the virtual machines. In order to accomplish this task, it effectively had to simulate numerous operations on the real machine. For simulation of virtual-virtual memory ... it basically implemented a softare associtive-array/TLB that are referred to as "shadow tables".
When a virtual machine executed a control instruction that switched virtual address space ... the shadow-tables for that virtual machine were cleared/reset (analogous to resetting all the TLB entries) and processing actually occurred using the shadow table pointer ... not pointing to the tables in the virtual machine memory (i.e. just like a processor doesn't "run" out of the address tables ... but really runs out of the TLB). The first thing that happens is a page fault (since all the entries in the shadow table have been cleared). The virtual machine simulator than examines the page fault and the tables in the virtual machine memory ... and loads the appropriate entry in the shadow table with the appropriate value (analogous to the hardware loading a TLB entry).
Since this is recursive ... it was actually possible to run a virtual cp/67 in the virtual machine of a virtual cp/67 in a virtual machine of a real cp/67 on a real 360/67. While this seems a little extreme, the methodology was actually used for real live project. More than a year before any 370 machines had been built ... a version of CP/67 was developed that provided virtual 370 relocation architecture (rather than 360/67 relocation architecture). The instructions for 67 & 370 relocation were different and some of the table formats were somewhat different. This CP' would simulate 370 relocate virtual machines by interpreting the tables in the virtual machine address space according to the 370 architecture rules ... but would translate into shadow tables built using 67 architecture rules.
Now since this was more than a year before the hardware had been built, there was no operating system to test it out own ... so another set of modifications to CP/67 were made (lets call this CP'') which ran on 370 relocate machine and produced virtual 370 machines. Both CP' and CP'' interpreseted tables in their virtual machine memory according to the 370 architecture rules ... but CP' built 360/67 tables and issued 360/67 control instruactions; while CP'' build 370 tables and issued 370 control instructions.
Running a year before any 370 hardware had been built was a configuration consisting of:
real 360/67 cp/67 running on real 360/67 providing 360/67 virtual machines cp' running 67 virtual machine providing 370 virtual machines cp'' running in 370 virtual machinecp/67, cp', and cp'' all were providing software virtual TLB emulation to its virtual machines.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Rethinking Virtual Memory Date: 16 Dec 1994 18:07:53 GMT... ok, tales out of school. In the late '60s I did global LRU (effectively what some of the unixes started calling clock in the '80s ... R.Carr's virtual memory management thesis, 1981 and various papers by him, Hennessy, et. al). In early '70s I also did a variation which not only would approximate true LRU ... but actually beat true LRU ... especially in the pathological scenerios where you really didn't want LRU but MRU ... and at my same short pathlengths (previous postings here this year as well as recent postings in alt.folklore.computers).
in any case, there was this performance modeling group that prooved that there was better performance by modifying clock-like replacement algorithms to bias for non-changed pages (i.e. replace non-changed pages before replacing changed pages ... since it eliminated the write operation). In any case, the MVT->SVS group became convinced that the optimization should be used (despite protestations to the contrary). Things were chunging along nicely until a couple releases into the MVS product cycle ... when it occurred to somebody that the "shared R/O linklib" (i.e. highly used program library code executed in common by all address spaces) was being paged out before private application data regions. The downside was that not only was "high-use" program pages being selected for replacement prior to lower-use data pages .... but "missing" high-use shared library pages tended to result in page faults across large percentage of all address spaces ... whereas ... the lower-use private data pages would have only resulted in single-address space page fault/blockage.
===================
multi-levels of cache ... don't forget ibm mainframe extended-storage. for one of the systems around 74 or so ... i did a page migration implementation extending LRU-like replacement algorithms to memory storage hierarchies ... effectively memory, fixed-head disk (one head per cylinder) and moveable-head disk ... then a little later extended it to hierarchy of fixed-head disk areas (center of movable arm disk, on either side of center, different disk pools, etc). mainframe expanded storage is variation on electronic disk ... except special high-speed internal bus with high-speed "synchronous" transfer between expanded storage and regular processor memory. Synchronous eliminated the thousands & thousands of instructions in the kernel I/O pathlength scheduling and asynchronous operation handling (not like the number I had for earlier systems ... see reference I posted today in alt.folklore.computers).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Rethinking Virtual Memory Date: 18 Dec 1994 19:03:25 GMT... live estimates ... it wasn't as critical when technology generations were 7-8 years ... but as generations start to decline to 1-3 years ... architecture adaptability/flexibility/fluidity easily begin to dominate the equation (not how well it works in any particular environment) ... sorry ... this is my boyd soap-box (see my "Re: bloat" posting here 29Nov1994 ... or my archived postings from alt.folklore.computers ... ftp.netcom.com/pub/ly/lynn).
for example ... somebody has a chart showing processor performance technology far exceeding I/O interconnect performance ... until the early '90s where they cross ... when I/O interconnect bandwidth becomes almost unlimited and processing power becomes bottleneck (affecting numerous application paradigms that tried to optimize transportation of bytes) ... i.e. from my early '70 days and dynamic adaptive "scheduling to the bottleneck"
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Rethinking Virtual Memory Date: 18 Dec 1994 19:37:36 GMTpagefault traps; ... almost every "high-performance" subsystem that does its own threading. logical equivalent in unix are applications that want to use asynch. I/O (the equivalent of which the mainframe apps have had all along since the beginning and was frequently the standard paradigm). As applications get larger and their working sets become weaker ... there is a higher & higher probability that they encounter blocking because of page faults (at the extreme would be all file activity done via memmap). At that point some metaphore similar to asynch. I/O is required ... but for virtual memory.
we introduced pagefault traps as part of the 370 138/148 which were
heavily oriented towards VM environment (ref: my comments on 138/148
microcode in postings to alt.folklore.computers archived at
ftp.netcom.com/pub/ly/lynn).
(NOTE: since moved to:
https://www.garlic.com/~lynn/
Use was by the 138/148 scp products
(dos/vs & vs1) ... which would typically run what they thought were
1-to-1 virtual to real mapping (but was really virtual machine).
One of the reasons was that cp paging tended to have up to 10:1 performance advantage over dos&vs1 paging (pathlength, etc) ... except on the downside that the virtual machine couldn't task-switch across the page-fault (i.e. it was possible to run VS1 faster under cp than w/o cp). Another reason was that if the operating environment had to be mixed-mode under cp ... LRU replacement algorithms aren't recursive (LRU tends to exhibit MRU behavior to next level LRU).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Measuring Virtual Memory Date: 21 Dec 1994 16:46:34 GMTI think I've described this before ... but what we did for dynamic feedback resource control algorithms (including default cycling mode with fair share support) in the early '70s was extensive instrumentation and data gathering on several production systems (five minute snapshots of overall system and individual process activity for several years across possibly 100 mainframes?).
Lots of that information was reduced to provide workload profiles and input to capacity planning models (i.e. given various workload profiles and specific resources ... do predictions based on what would specific changes to resource would mean).
Using the workload profile information we also developed synthetic workloads that were calibrated to exhibit real life workload characteristics. Three methodologies were then used ... 1) automated workload operation (automated cycle involved rebuilding kernel, rebooting, running benchmark, terminating benchmark, repeat cycle), 2) specific set of different benchmark workloads that represented nominal points covering various aspects of performance envelopes, 3) automated "hill-climbing" methedology that would look at results of previous benchmarks and attempt to select next benchmarking points.
For the first release of the Resource Manager PRPQ (in the mid 70s), I performed a series of 2000 benchmarks over a period of 3 months. This included validating fair-share operation over a wide-range of load and conditations; validating that priority changes for non fair share operation resulted in predictable changes in resource, validating that under extreme stress loads that the system degrading in a predictable, "graceful" manner (i.e. pick nominal "heavy" load for a specific workload profile .... and then increase the workload by a factor of 10 times, or increase the paging loading by 10 times, or increase the arrival rate for file I/O by 20 times, etc).
Being able to profile/characterize the workload and also have a resource manager that operated in a predictable manner under a wide range of machines and workloads helped in being able to do capacity planning and throughput predictions. It also somewhat simplified the capacity planning analytical model that was also used to predict effects of addition or reduction of resources (more memory, more disk arms, faster cpu, more cpus, etc).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: How Do the Old Mainframes Date: 23 Dec 1994 17:16:48 GMTI took intro to fortran spring of my sophomore year ... it was tube 709 with a 1401 used for unit-record front-end. student jobs (card decks) were batched in trays and read by the 1401 on 2540 card reader and written to 200bpi 7track tape. operator carried the tape over to the 709 tape drive ... where 709 ibsys ran the student fortran jobs tape-to-tape. Operator carried the output tape from the 709 back to the 1401 tape drive which read the output tape and printed to 1403 printer (or in some cases punch cards to the 2540 card punch). 709 was gigantic thing ... 32k 36bit words (tubes). There was something about requiring 20 (40?) ton air conditioner to handle all the heat.
That summer they got a 360/30 (64k 8bit bytes) and I got my first programming job to re-implementing the 1401 frontend MPIO (unit-record <-> tape) program (even tho the 30 had a 1401 hardware emulation mode). I had to design my own supervisor, device drivers, interrupt handlers, i/o subsystems, memory allocation, etc. After a month or so, I had a 2000 card/statement 360 assembler program that took 30minutes to assemble under the 360 operating system ... but took over the whole machine when it ran (used most of memory for elastic buffers and could run card->tape and tape->printer/punch at the same time). They also wanted a version that ran under the operating system ... which required conditional assembly for either my operating system or 360 system services macros. 360 system services macros were a bear, one of them; a DCB was required for each device; it took the assembler 6 minutes elapsed time to process each DCB, it was so bad that it was even recognizable in the light patterns on the front panel. Just the five DCBs added another 30 minutes to the elapsed assemble time.
At the start of the fall system, they went back to regular schedule; but they would let me have the machine room on weekends ... typically they would finish up processing around 8am on saturday ... and then I would pull a straight 48hr shift until 8am on Monday ... before going off to a shower and classes.
About six months later, the 709 was replaced with a 360/67 (initial 512k 8bit bytes ... but upgraded to 768k). It was used for business apps and student jobs ... but had off-shift testing of IBM's tss/360 (interactive, virtual memory, etc). In Jan of '68, the university was selected as test site for CP/67 (csc had converted it from cp/40 and was running at cambridge and also out at lincoln labs).
At that time tss/360 could just barely handle 4 concurrent interactive users. The objective with cp/67 was to run the 360 batch operating system in a virtual machine along with as many interactive CMS users as possible.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: How Do the Old Mainframes Date: 24 Dec 1994 17:39:33 GMTa very significant point of cp/40-cp/67-vm/370 genre was that it was possibly one of the first microkernels. it is frequently overlooked, but the simplicity of creating a very high brick-wall between kernel/resource algorithms and other parts of the system went a long way towards managing complexity. I believe that there have been some number of thesis on the subject over the years after it started shipping as a product. The original paper/proposal was written by Bob Creasy at CSC ... I believe sometime late in '65. A more detailed history is Melinda Varian's paper "what your mother never told ...", available via ftp/anon somewhere at Princeton.
while the cp group concentrated on virtual machine support ... the CMS group concentrated on turning out a user interface. It eliminated a lot of confusion thru the years about what function got put where, what the interfaces were, and cut donw on serious mistakes (that frequently happened elsewhere) involving implementing the wrong function in the wrong place.
On the other hand ... while I (personally) pretty much did CP things on the CP side of the line and CMS stuff on the CMS side of the line during most of '68 .... starting in '69 I started also "enhancing" (violating) the architecture as defined in the official 360 Principle of Operations hardware manual (i.e. first a short-cut for I/O translation, a heavy pathlength item for the kernel).
During the early years I was frequently concerned with pathlength; first cutting it as far as possible and then doing detailed analysis regarding most common status ... and then implementing specialized paths for the most common cases (i.e. fastpath). Between January of '68 when I was first introduced to CP/67 until approximately Sept of '68 I had reduced CP pathlengths (for a specific workload) by an order of magnitude.
Previously I had deveveloped some optimization methodology for the standard batch OS product that cut the elapsed processing time for our typical workload in half (mostly by careful placement on disk of various system components/libraries). In the beginning the elapsed for baseline workload would take 2.5 times as long to run in a virtual machine (as it did stand alone; frequently observation of mine is "how many other operating systems were constantly faced with my application takes X% longer than if the kernel didn't exist", and if they had been, would they have focused more attention on efficiency) Between Jan. '68 and Sept. of '68, I had reduced that from 2.5 to 1.15.
Effectively the standalone (normal, non-virtual machine) component was composed of I/O wait, problem "state" instructions, and privilege "state" instructions (I had already doubled the standalone baseline thruput by a huge reduction in the I/O wait time). The virtual machine operation introduced additional overhead composed of software simulation for all privilege instructions. The virtual-machine-operation increase in elapsed time was pure compute bound software simulation. For a 10 minute baseline, CP originally would add 15 minutes software simulation (i.e. 25 minutes elapsed instead of 10); I initially reduced that to 1.5 minutes (11.5 minutes instead of 10).
That soemwhat addressed single virtual machine batch component, although "fastpath" has a danger that the "most frequent" distribution may change over time (those 90 percentile operations that had special case fastpath, may drop to only 10 percent or less).
Multi-user, mixed-mode, (batch and interactive) ... introduced other issues. In addition to raw pathlengths, there was all sorts of stuff associated with linear lists and frequent scanning. I created (clock) global LRU page replacement, not only because it was significantly better than what was there (which quickly degenerated to FIFO) but also because it eliminated scanning all pages each time (non-scalable), frequently 100:1 pathlength reduction.
I also created an algorithmic dynamically adaptable resource allocation, which increased the complexity of the dispatching priority calculations (allowing various combinations of fair-share, non-fair-share, somewhat dynamically adaptable to the bottlenecking resource), but significantly reduced overall pathlengths and eliminated another couple linear list scans (eliminated the periodic process priority adjust that afflects most current unixs ... remember both cp and unix trace common heritege to ctss). The scheduler had some amount of complex stuff that would adapt the multiprogramming level based on efficiency of the paging systems. Increases in the multiprogramming level would increase page contention and in the worst case lead to page thrashing. The dynamics went beyond simple working-set to observe if across a large range of configurations there were different paging devices & different levels of contention for those paging devices ... then based on efficiency/thruput of the page I/O subsystem ... the overall system could tolerate varying degress of page contention ... walking thin-line regarding optimal multiprogramming level (for maximal thruput) with maximum tolerated page contention. This was somewhat a perversion of "working-set" which basically would calculate the one, true working-set size value.
I've made several recent postings on this subject releated to virtual
memory in comp.arch ... also this thread is somewhat a repeat of the
one here last march ... I archived my postings at
ftp.netcom.com/pub/ly/lynn.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
From: lynn@garlic.com (Anne & Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: How Do the Old Mainframes Compare to Today's Micros? Date: 25 Dec 1994 00:25:30 GMT... originally posted to "bloat" thread in comp.arch ... 27nov94
.. i've posted this before ... but
23-24 years ago, we had 360/67 (.5mips; no cache, & I/O would cause lots of interference on memory bus ... i.e. more like .3-.4mips with i/o) with 768k running cp/cms (105 4k pageable pages after fixed memory requirements) supporting 70-80 users with subsecond response for trivial transactions .... ran word-processing (script ... precursor to gml/sgml/html), edit, program development (assembler, fortran, pli), APL applications (modeling).
... comparing it to system 12 years later (similar interactive profile/response) from report in early '80s:
system 3.1L HPO change machine 360/67 3081 47* (mips) pageable pages 105 7000 66* users 80 320 4* channels 6 24 4* drums 12meg 72meg 6* page I/O 150 600 4* user I/O 100 300 3* # disk arms 45 32 4*?perform bytes/arm 29meg 630meg 23* avg. arm access 60mill 16mill 3.7* transfer rate .3meg 3meg 10* total data 1.2gig 20.1gig 18*... i.e. over period of 12+ years, number of interactive CMS users increased in proportion to I/O thruput NOT cpu capacity. Also typically during the period, E/B profiling switched from MIPS/MBYTES to MIPS/mbits.
Environment did change from line-mode terminals to channel-attached display screens. Around '81, I did a high-performance implementation of NSC HYPERchannel remote-device support ... remoting all the display controllers for a number of systems. There was an unanticipated side-effect which resulted in >15% system thruput improvement. Turned at that the i/o bus busy time for HYPERchannel was significantly lower than same exact activity with direct-attached display controllers (although in some quarters the reaction was similar to the terminal controller that I worked on 15 years earlier that became the origins of the ibm-compatible controller industry).
The reduction in I/O bus busy time resulted in corresponding thruput improvement for disk i/o.
... in any case, I've recently ran into what appeared to be page-thrashing on single user NT3.5 system running on 16mbyte 486DX50 attempting to install latest SDK from CDROM. It got to 48% complete in around 30minutes ... and then appeared to go into page-thrashing mode, taking around 4hrs total to complete.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com