Musings on project management: Testing

Showing posts with label Testing. Show all posts

Wednesday, March 27, 2024

AI-squared ... a testing paradigm

AI-squared. What's that?

Is this something Project Managers need to know about?

Actually, yes, PMs need to know that there are entirely new test protocols coming that more or less challenge some system test paradigms that are at the heart of PM best practice.

AI-squared

That's using an AI device (program, app, etc.) to validate another AI device, sometimes a version difference of itself! Like GPT-2 validating -- or supervising, which is a term of art -- GPT-4. (Is that even feasible? Read on.)

As reported by Matteo Wong, all the AI firms, to include OpenAI, Microsoft, Google, and others are working on some version of "recursive self-improvement (Sam Altman), or as OpenAI researchers put it: the "alignment" problem which includes the "supervision problem", to use some of the industry jargon.

From a project development viewpoint, these techniques are close to what we traditionally think of as verification that results comport with the prompt, and validation that results are accurate.

But in the vernacular of model V&V, and particularly AI "models" like GPT-X, the words are 'alignment' and 'supervision'

Alignment is the idea of not inventing new physics when asked for a solution. Whatever the model's answer to a prompt is, the prompted answer has to "align" with the known facts, or a departure has to be justified. One wonders if Einstein (relativity) and Planck (quantum theory) were properly "aligned" in their day.
'Supervision is the act of conducting V&V on model results. The question arises: who is "smarter": the supervisor or the supervised? In the AI world, this is not trivial. In the traditional PM world, a lot of deference is paid to the 'grey beards' or very-senior tech staff as the font of trustworthy knowledge. This may be about to change.

And now: "Unlearning"!

After spending all that project money on training and testing, you are now told to have your project model "unlearn" stuff. Why?

Let's say you have an AI engine for kitchen recipes, apple pie, etc. What other recipes might it know about? Ones with fertilizer and diesel? Those are to be "unlearned"

One technique along this line is to have true professional experts in the domains to be forgotten ask nuanced questions (not training questions) to ascertain latent knowledge. If discovered, then the model is 'taught to forget'. Does this technique work? Some say yes.

What to think of his?

Obviously, my first thought was "mutual reinforcement" or positive feedback ... you don't want the checker reinforcing the errors of the checked. Independence of the developers by the testers has been a pillar of best-practices project process since anyone can remember.

OpenAI has a partial answer to my thoughts in this interesting research paper.

But there is the other issue: so-called "weak supervision" described by the OpenAI reseachers. Human developers and checkers are categorized as "weak" supervisors of what AI devices can produce.

Weakness arises by limited by time, by overwhelming complexity, and by enormous scope that is economically out of reach for human validation. And, humans are susceptible to biases and judgments that machines would not be. This has been the bane of project testing all along: humans are just not consistent or objective in every test situation, and perhaps from day to day.

Corollary: AI can be, or should be, a "strong supervisor" of other AI. Only more research will tell the tale on that one.

My second thought was: "Why do this (AI checking AI)? Why take a chance on reinforcement?"

The answer comes back: Stronger supervision is imperative. Better timeliness, better scope, and improved consistency of testing, as compared to human checking, even with algorithmic support to the human.

And of course, AI testing takes the labor cost out of the checking process for the device. And reduced labor cost could translate into few jobs for AI developers and checkers.

Is there enough data?

And now it's reported that most of the low hanging data sources have been exploited for AI training.

Is it still possible to verify and validate ever more complex models like it was possible (to some degree) to validate what we have so far?

Unintelligible intelligence

Question: Is AI-squared enough, or does the exponent go higher as "supervision" requirements grow because more exotic and even less-understood AI capabilities come onto the scene?

Will artificial intelligence be intelligible?
Will the so-called intelligence of machine devices be so advanced that even weak supervision -- by humans -- is not up to the task?

Like this blog? You'll like my books also! Buy them at any online book retailer!

Wednesday, July 20, 2022

Submarines in the Idaho desert

Submarines in Idaho?

In the desert?

Actually, a hull, a reactor, and a steam turbine.

All part of the 1953 simulation of the first ever naval submarine nuclear propulsion reactor to run in an actual ship configuration all the way to turning a propeller shaft.

That is what we PMs call a full-scale model and simulation test.

And not only was this first-ever in a ship's configuration, but the reactor-steam turbine combo ran for 96 continuous hours, a near-miracle for the technology of the day.

"Radical technologies require conservative engineering"
Admiral Rickover, the father of naval nuclear power

And what significance was 96 hours?

That was the time needed -- according to estimates -- for the soon-to-be Nautilus to transit continuously submerged from North America to Europe.(*)

We know the rest of the story: this "submarine in Idaho" project simulation and testing led to a successful Nautilus, followed by widespread nuclear naval power within a few decades.

.(*) Did you know: in World War II, only 20 miles was the extended range for continuously submerged?

Like this blog? You'll like my books also! Buy them at any online book retailer!

Tuesday, October 2, 2018

Is your tail heavy?

'Is your tail heavy' is the question raised at 'critical uncertainties' in a recent post.
It might be if you are a risk with some "memory" of the immediate past.

Risk with memory? What does that mean?

The immediate past influences the immediate future
The probability of the arrival of an outcome is not time-stationary: as time passes, the probabilities change
The distribution of the arrival time of an outcome is "heavy tailed", meaning that (usually) with more time: if it hasn't happened it probably won't happen

In the posting (above), an example is the expected arrival of an email: Near term, it's expected. But, if doesn't get here soon, it may not get here at all

Project consequences:

Simple assumptions, like symmetrical bell curves, are unlikely to give a good picture of when a risky outcome may happen
Testing for an unlikely outcome may be easier and more economical than you might think: run a few tests; if it doesn't fail soon (infant mortality) it likely won't for a long while.
Early on, consumer electronics exhibited such behavior. (If you could make it a few days, you were likely to make it a few years)

Who knew

Who knew heavy tails were the cheap way out of expensive testing??!

Read in the library at Square Peg Consulting about these books I've written
Buy them at any online book retailer!

Read my contribution to the Flashblog

Saturday, November 11, 2017

Tools for testing

Put the defect on a card,
Put the card on a wall in the war room, and
Work it into the backlog as a form of requirement that is commonly labeled "technical debt".

For sure, that's one way to handle it. Of course, the cards are perishable. Many say: so what? Once fixed, there's no need to clutter things up with a bunch of resolved fixes.

Lisa Crispin and Janey Gregory, writing in their book "Agile testing: A practical guide for testers and Agile teams", have a few other ideas. From their experience, there are these reasons to use an automated tool to capture and retain trouble reports:

Convenience: "If you are relying only on cards, you also need conversation. But with conversation, details get lost, and sometimes a tester forgets exactly what was done—especially if the bug was found a few days prior to the programmer tackling the issue.If you are relying only on cards, you also need conversation. But with conversation, details get lost, and sometimes a tester forgets exactly what was done—especially if the bug was found a few days prior to the programmer tackling the issue."

Knowledge base: Probably the only reason to keep a knowledge base is for the intermittent problems that may take a long time and a lot of context to work out. The tracker can keep notes about observations and test conditions

Large or distributed teams: It's all about accurate communications. A large or distributed team can not use a physical war room that's in one place

Customer support: If a customer reports the defect, they're going to expect to be able to get accurate status. Hard to do with a card hanging on the wall if the customer isn't physically present.

Metrics: Agile depends on benchmarks to keep current and up to date team velocity.

Traceability: It's always nice to know if a particular test case led to a lot of defects. Obviously, many defects will not come from a specific test; they'll be found by users. But it never hurts to know.

Of course, there a few reasons to be wary of a database-driven DTS tool. Number one on my list is probably one that makes everyone's list:

Communications: It's not a good idea to communicate via notes in the DTS. Communications belongs first face-to-face, and then in email or text if a record is needed. DTS is for logging, not for a substitute for getting up and talking to the counter-party.

Lean: All tools have a bit of overhead that does not contribute directly to output. Thus, maximum lean may mean minimum tooling.

Bottom line: Use the tool! I've always had good results using tracking tools. It just take a bit of discipline to make them almost transparent day-to-day

Read in the library at Square Peg Consulting about these books I've written
Buy them at any online book retailer!

Read my contribution to the Flashblog

Tuesday, July 19, 2016

Testing... how much testing?

Consider this quote from Tony Hoare:

One can construct convincing proofs quite readily of the ultimate futility of exhaustive testing of a program and even of testing by sampling. So how can one proceed?

The role of testing, in theory, is to establish the base propositions of an inductive proof.

You should convince yourself, or other people, as firmly as possible, that if the program works a certain number of times on specified data, then it will always work on any data. This can be done by an inductive approach to the proof.

Everyone remember what an inductive argument is? Yes? Good!
Then I don't need to say that ....
When reasoning inductively you are reasoning from a specific instance to a generalization. The validity of the generalization can only be known probabilistically. (It's probable-- with high confidence -- that any instance will work because a particular instance -- or particular set of instances -- worked)

Thus, there is no absolute certainty that a generalization will work just because a large set of specifics works.

Monte Carlo testing is inductive. There are two inductive conditions in Monte Carlo testing:

That the test conditions are time-insensitive ... thus, tests done serially in time are valid for any time, and are valid if the tests were done in parallel rather than serial
That the test results for a finite number of iterations is representative of the next iteration that is not in the test window

Sampling is also inductive.
And why would one do this?
Economics mostly.
Usually you can't afford to test every instance possible; sometimes, you can't create all the conditions to test every instance possible. The plain fact is, as Mr Hoare says, you should be able to do "just enough" testing to convince yourself and others who observe objectively that, hey: it will work!

Read in the library at Square Peg Consulting about these books I've written
Buy them at any online book retailer!

Read my contribution to the Flashblog

Thursday, October 10, 2013

Who knew? Complexity is free!

I learned just this month that 'complexity is free'. Who knew? Some years ago, it was quality that was free. Now, we've got the whole package: complexity and quality! And, all free.

Not so fast!

What we are talking about is the invasion of the agile refactor paradigm 'RGR' -- red, green, refactor -- into the pure hardware business, and the application of 'continuous integration' (CI) or perhaps even the hardware equivalent of X-unit testing.

And, what is the instrument of this invasion: the 3-D printer! In this essay, we learn about how General Electric is applying 3-D printing to a hardware version of RGR-CI. According to Luana Iorio, who oversees G.E.’s research on three-dimensional printing, here is what is going on:

Today ... engineers using three-dimensional, computer-aided design software now design the part on a computer screen. Then they transmit it to a 3-D printer, ...... Then, you immediately test it — four, five, six times in a day — and when it is just right you have your new part. .... That’s what [is what is meant] by complexity is free.
“The feedback loop is so short now,” .... that “in a couple days you can have a concept, the design of the part, you get it made, you get it back and test whether it is valid” and “within a week you have it produced. ... It is getting us both better performance and speed.”

Sound like agile? Certainly does to me! We've now moved to 'agile in the hardware' being a practical idea...

Bookmark this on Delicious

Check out these books I've written in the library at Square Peg Consulting

Monday, June 3, 2013

Testing COE and Agile

The question: Many organizations have created Testing Centers of Excellence. Is there a place for this in Agile or is this approach counter to the intimate nature of Agile?

My answer: COE's usually are staffing pools, where the pool manager focuses on skill development, and sends/assigns the skilled staff out to the dev teams. In this regard, the test person from the COE simply joins the dev team.

Sometimes, a COE is in the workflow of the project; in this scenario, the release package goes through the COE for some kind of validation testing before release to production.

Of course, agile is not a complete methodology if the field is not green; integration/UAT testing with the existing product base often is handled in a traditional sequence, post-agile development. The COE could be responsible for this last testing step.

Bookmark this on Delicious

Check out these books I've written in the library at Square Peg Consulting

Monday, March 25, 2013

Managing Technical Debt

Steve McConnell has pretty nice video presentation on Managing Technical Debt that casts the whole issue in business value terms. I like that approach since we always need some credible (to the business) justification for spending Other People's Money (OPM)

McConnell, of course, has a lot of bona fides to speak on this topic, having been a past editor of IEEE Software and having written a number of fine books, like "Code Complete".

In McConnell's view, technical debt arises for three reasons:

Poor practice during development, usually unwitting
Intentional shortcuts (take a risk now, pay me later)
Strategic deferrals, to be addressed later

In this presentation, we hear about risk adjusted cost of debt (expected value), opportunity cost of debt, cost of debt service -- which Steve calls debt service cost ratio (DSCR), a term taken from financial debt management -- and the present value of cost of debt. In other words, there's a lot of ways to present this topic to the business, and there's a lot ways to value the debt, whether from point 1, 2, or 3 from the prior list.

One point well made is that technical debt often comes with an "interest payment". In other words, if the debt is not attended to, and the object goes into production, then there is the possibility that some effort will be constantly needed to address issues that arise -- the bug that keeps on buging, as it were. To figure out the business value of "pay me now, pay be later", the so-called interest payments need to be factored in.

In this regard, a point well taken is that debt service may crowd out other initiatives, soaking up resources that could be more productively directed elsewhere. Thus, opportunity cost and debt service are related.

Bottom line: carrying debt forward is not free, so even strategic deferrals come with a cost.

Bookmark this on Delicious

Check out these books in the library at Square Peg Consulting

Thursday, October 25, 2012

SCRUM + Nine

In a paper supported by Microsoft and NC State University, we learn about SCRUM practices by three teams, and about nine practices that these teams applied as agumentation to SCRUM. The great thing about this paper is that it is well supported by metrics and a mountain of cited references, so not as "populist" as others

To begin, the Microsoft authors describe SCRUM this way:

The Scrum methodology is an agile software development process that works as a project management wrapper around existing engineering practices to iteratively and incrementally develop software.

I like that description: "project management wrapper", since, unlike XP and other agile methodologies, SCRUM is almost exclusively a set of loosely coupled PM practices.

That said, we read on to learn about three teams, A, B, and C. We learn that story points live! Microsoftees like them (and so does Jeff Sutherland):

The Microsoft teams felt the use of Planning Poker enabled their team to have relatively low estimation error from the beginning of the project. Figure 1 [below] depicts the estimation error for Team A (the middle line) relative to the cone of uncertainty (the outer lines). The cone of uncertainty is a concept introduced by [Barry] Boehm and made prominent more recently by [Steve] McConnell based upon the idea that uncertainty decreases significantly as one obtains new knowledge as the project progresses.Team A’s estimation accuracy was relatively low starting from the first iteration. The team attributes their accuracy to the use of the Planning Poker practice.

And, what about the other 8 practices? The ones cited are:

Continuous integration (CI) with Visual Studio (a Microsoft product)
Unit TDD using using the NUnit or JUnit frameworks
Quality gates, defined as 1 or 0 on predefined 'done' criteria
Source control, again with Visual Studio
Code coverage by test scripts followed the Microsoft Engineering Excellence recommendation of having 80% unit test coverage
Peer reviews
Static analysis of team metrics
Documentation in XML

And what conclusion is drawn?

The three teams were compared to a benchmark, 10 defects/line of code. Two of the three teams did substantially better than the benchmark (2 and 5) and one team did substantially worse (21). The latter team is reported (in the paper) to have scrimped on testing. Thus, we get this wise conclusion:

These results further back up our assertion on the importance of the engineering practices followed with Scrum (in this case more extensive testing) rather than the Scrum process itself.

Wow! That's a biggie: design-development-test practices matter more than the PM wrapper! We should all bear this in mind as we go about debating wrappers.

And, one more, about representation and availability in another situation:

...our results are only valid in the context of these three teams and the results may not generalize beyond these three teams.

And, last, what about errors in cause and effect?

There could have been factors regarding expertise in the code base, which could have also contributed to these results. But considering the magnitude of improvement 250%, there would still have to be an improvement associated with Scrum even after taking into account any improvement due to experience acquisition.

And, need I mention this gem?

Bookmark this on Delicious

Saturday, March 17, 2012

Exploratory testing

James Bach writes about exploratory testing.

In his paper, "Exploratory Testing Explained", Bach says exploratory testing is misunderstood and abused as just putzing around with no real plan in mind. Yet, he claims it can be way more productive than scripted testing. Perhaps.

Bach explains:

Exploratory testing is also known as ad hoc testing. Unfortunately, ad hoc is too often synonymous with sloppy and careless work. So, in the early 1990s a group of test methodologists (now calling themselves the Context-Driven School) began using the term “exploratory”, instead. With this new terminology, first published by Cem Kaner in his book "Testing Computer Software", they sought to emphasize the dominant thought process involved in unscripted testing, and to begin to develop the practice into a teachable discipline.

Exploratory testing is simultaneous learning, test design, and test execution.

This is a general lesson about puzzles: the puzzle changes the puzzling. The specifics of the puzzle, as they emerge through the process of solving that puzzle, affect our tactics for solving it. This truth is at the heart of any exploratory investigation, be it for testing,development, or even scientific research or detective work.

I think this paragraph sums it up:

The external structure of ET is easy enough to describe. Over a period of time, a testerinteracts with a product to fulfill a testing mission, and reporting results. There you have the basic external elements of ET: time, tester, product, mission, and reporting. The mission is fulfilled through a continuous cycle of aligning ourselves to the mission, conceiving questions about the product that if answered would also allow us to satisfy our mission, designing tests to answer those questions, and executing tests to get the answers

Bookmark this on Delicious

Saturday, February 4, 2012

Defect tracker for Agile?

Tools linked to process, and process linked to tools, are always grist for debate in the agile space. Defect tracking is one of those processes that begs the question: to engage with a tool or not? Many say: put the defect on a card, put the card on a wall in the war room, and work it into the backlog as a form of requirement that is commonly labeled "technical debt".

For sure, that's one way to handle it. Of course, the cards are perishable. Many say: so what? Once fixed, there's no need to clutter things up with a bunch of resolved fixes.

Convenience: "If you are relying only on cards, you also need conversation. But with conversation, details get lost, and sometimes a tester forgets exactly what was done—especially if the bug was found a few days prior to the programmer tackling the issue.If you are relying only on cards, you also need conversation. But with conversation, details get lost, and sometimes a tester forgets exactly what was done—especially if the bug was found a few days prior to the programmer tackling the issue."

Knowledge base: Probably the only reason to keep a knowledge base is for the intermittent problems that may take a long time and a lot of context to work out. The tracker can keep notes about observations and test conditions

Large or distributed teams: It's all about accurate communications. A large or distributed team can not use a physical war room that's in one place

Customer suport: If a customer reports the defect, they're going to expect to be able to get accurate status. Hard to do with a card hanging on the wall if the customer isn't physically present.

Metrics: Agile depends on benchmarks to keep current and up to date team velocity.

Traceability: It's always nice to know if a particular test case led to a lot of defects. Obviously, many defects will not come from a specific test; they'll be found by users. But it never hurts to know.

Of course, there a few reasons to be wary of a database-driven DTS tool. Number one on my list is probably one that makes everyone's list:

Communications: It's not a good idea to communicate via notes in the DTS. Communications belongs first face-to-face, and then in email or text if a record is needed. DTS is for logging, not for a substitute for getting up and talking to the counter-party.

Lean: All tools have a bit of overhead that does not contribute directly to output. Thus, maximum lean may mean minimum tooling.

Bottom line: Use the tool! I've always had good results using tracking tools. It just take a bit of discipline to make them almost transparent day-to-day

Bookmark this on Delicious

Sunday, October 9, 2011

The forest for the trees

We are all taught from the first moment that the way to eat an elephant is one bite at a time. That is: to work on a complex problem, always start by simplifying the task by decomposition, disaggregation, and separation of the trees from the forest.

Ok, but what about this: (think: object = tree; distant background = forest)

In a posting on "Azimuth", John Baez has a part I discussion of evolution in complex systems from which the diagram, above, is taken.

In addition to the distortion arising from the point of view from which data is viewed, he goes on to caution about the the Dunning-Krueger effect (the uninformed, misinformed, and ignorant are biased to not understand their own cognitive deficiencies), saying, in part:

... if we don’t understand a system well from the start, we may overestimate how well we understand the limitations inherent to the simplifications we employ in studying it.

But, it's not only trees: it's also use cases and user stories, and then ultimately Test-Driven-Development scripts, all decompositions that have the potential to alter perspective.

Thus, constant attention to "re-composition" to validate low level requirements with high level vision is necessary. That is the essence of the "V-Model", a bit of system engineering that we can all take advantage.

Bookmark this on Delicious

Are you on LinkedIn? Share this article with your network by clicking on the link.

Wednesday, September 23, 2009

Test Driven Development -- TDD

New to the agile scene? Curious about XP -- Extreme Programming? One practice born in XP and now widely dispersed in the agile community is Test Driven Development, TDD.

The main idea
Here are the main ideas of TDD, and they are a mind-bender for the traditionalist: Requirements are documented in the form of test scripts, and test scripts are the beginning point for product design.

TDD works this way: detailed design and development, after consideration for architecture, is a matter of three big steps:

Step 1: document development requirements with test scripts. Beginning with functional requirements in the iteration backlog, and in collaboration with users, the developer writes technical design requirements in the form of a test script, and writes the script in a form to be run with test-automating tools. Automated tools enable quick turnaround. Automated tests run fast, run on-demand, run repeatedly the same way, and run under various data and system conditions.

Step 2: run the test, modifying the object design until it passes. If the test fails, as it often will, the developer implements the quickest and simplest solution that will likely pass. Step 2 is repeated until the test passes.

Step 3: refine the detail design of the object. After the solution passes the test, the developer iterates the detail design to be compliant with quality standards. In doing so, only the internal detail is changed and improved; no change is made to the object’s external characteristics. To verify no external changes and continued functionality, the modified solution is retested internally.

A project management tip: TDD, unit tests, and acceptance tests

TDD was invented as a design practice; it is commonly applied to the lowest level design units
“TDD's origins were a desire to get strong automatic regression testing that supported evolutionary design.”4
Unit tests, distinct from TDD tests, are postimplementation design verification tests. In most respects, unit tests and TDD tests are very similar, but the purposes are very different: one drives design and the other verifies implementation.
The concept of automated tests is also applicable to higher-level tests, such as product integration tests and user acceptance tests, but these are not for the purpose of driving design.
At higher levels, designs that pass unit tests are being tested in a larger context and by independent testers, including end users.

Are you on LinkedIn? If so, share this article with your network by clicking on the link!

Pages

Wednesday, March 27, 2024

Wednesday, July 20, 2022

Tuesday, October 2, 2018

Saturday, November 11, 2017

Tuesday, July 19, 2016

Thursday, October 10, 2013

Monday, June 3, 2013

Monday, March 25, 2013

Thursday, October 25, 2012

Saturday, March 17, 2012

Saturday, February 4, 2012

Sunday, October 9, 2011

Wednesday, September 23, 2009

Subscribe Musings on Project Management