The dark side of Tri-State Locking

There is a great song for that (ah. My wife loves that song. And so do I): Poison – Every Rose Has Its Thorn (Official Music Video) (youtube.com)

Meanwhile listening to this rock ballad, when the drum kicks in heavily, try to imagine a Dynamics 365 Business Central that is rolling on a procedure and suddenly… CRASH. (a client crash). And to make this even worse, it happens randomly, even tough limited to specific custom procedures and data driven.

This is the footprint left in the Application log (event viewer). Ah yes, it is on-premises – this is secondary to the investigation, tough-:

Server instance: FOO

Category: Sql

ClientSessionId: 96b72502-0508-4489-bee5-8d4fc7b083f1

ClientActivityId: 550ba516-ddd0-57b5-ec80-0c70f22923fe

ServerSessionUniqueId: e87e6511-9811-461f-9860-7493556612aa

ServerActivityId: cde41795-d60e-4d42-bcf9-22535d6b36ff

EventTime: 05/20/2024 09:12:31

Message (Message 1 of 2) (NavSqlException): ParentException: NavSqlException

Errore di SQL imprevisto:

No more lock classes available from transaction.

Istruzione SQL:

SELECT  TOP (@0) "5406"."timestamp","5406"."Status","5406"."Prod_ Order No_","5406"."Line No_","5406"."Item No_","5406"."Variant Code","5406"."Location Code","5406"."Quantity","5406"."Scrap _","5406"."Production BOM No_","5406"."Routing No_","5406"."Routing Reference No_","5406"."Quantity (Base)","5406_ext"."EOS076 Sales Order No_$4cd45ea8-7d4c-4d5a-9677-5ff196cb9550",DATALENGTH("5406_ext"."PLF Glue Family Image$eec9ddee-3955-4b90-a46c-5e5b45ebeda0"),DATALENGTH("5406_ext"."PLF Glue 2 Family Image$eec9ddee-3955-4b90-a46c-5e5b45ebeda0"),"5406"."$systemId","5406"."$systemCreatedAt","5406"."$systemCreatedBy","5406"."$systemModifiedAt","5406"."$systemModifiedBy" FROM "FOO".dbo."FooCompany$Prod_ Order Line$437dbf0e-84ff-417a-965d-ed2bb9650972" "5406"  WITH(READCOMMITTED)  JOIN "FOO".dbo."FooCompany$Prod_ Order Line$437dbf0e-84ff-417a-965d-ed2bb9650972$ext" "5406_ext"  WITH(READCOMMITTED)  ON ("5406"."Status" = "5406_ext"."Status") AND ("5406"."Prod_ Order No_" = "5406_ext"."Prod_ Order No_") AND ("5406"."Line No_" = "5406_ext"."Line No_") WHERE ("5406"."Status"=@1 AND "5406"."Prod_ Order No_"=@2 AND "5406"."Routing No_"=@3 AND "5406"."Routing Reference No_"=@4) ORDER BY "Status" ASC,"Prod_ Order No_" ASC,"Line No_" ASC OPTION(FAST 50)

StackTrace:

   at System.Environment.get_StackTrace()

   at Microsoft.Dynamics.Nav.Runtime.NavSqlConnection.ThrowNavSqlException(SqlExceptionAdapter exceptionAdapter, Boolean transactionNoLongerValid, String commandText, Boolean isLastExceptionARollbackCause, Boolean logExceptions, NavCancellationToken cancellationToken)

   at Microsoft.Dynamics.Nav.Runtime.NavSqlConnection.MapException(SqlExceptionAdapter exceptionAdapter, String commandText, Boolean isRollbackAction, Nullable`1 timeout, NavCancellationToken cancellationToken)

   at Microsoft.Dynamics.Nav.Runtime.NavSqlConnection.<>c__DisplayClass137_0`1.<ExecuteFunction>b__0()

What the heck is? And why?

No more lock classes available from transaction.

Initially, I thought there were an excessive AL recursion, and my thought was that lock classes were full before the maximum AL recursion allowed was reached. Therefore, I have setup AL recursion to a very low value (4) and restarted NSTs, but the problem persisted.

Three more elements were added on the plate.

  1. Environment was just updated from 21.5 to 23.6.
  2. The features where this crash started were working beautifully before the upgrade took place.
  3. Tri-state locking enabled at database and application level (see Tri-state of mind: unvealing the real power of READ COMMITTED isolation level.).

Looking at the SQL Server query, you could clearly see the WITH(READCOMMITTED) hint.

So, the next step was to turn off that feature and fall back to REPEATABLEREAD: no more client crash.

This problem is generated, then, by the enablement of tri-state locking feature (and Read Committed Snapshot Isolation for the database). Also highlighted through a SQL Server profiler trace exception

Followed by a burp-in-the-face User Error Message : Uncommittable transaction is detected at the end of the batch. The transaction is rolled back.

Code is (was) quite complex and certainly it might be rewritten “a bit” more backend savvy. An example is the usage – that I personally dislike – of Mark/Markedonly and other super-filtered artifacts that generate loooooooooooooong query text like the following.

I know that it could be a good Pop Art painting to be placed over the sofa but – hell – for real queries in production, no thanks.

How to (quickly) resolve this problem?

TEMPORARY HEIGHTENING

In the medium-term, refactoring code is absolutely a need here, everyone agrees on that. But if you are in the business since a while now, you know that you need to unblock production environments and this has to be done quickly.

Disable tri-state locking for the entire environment could be an option. But this is like a chop block that cut legs to one of the best performance features introduced since years (if not decades) for Dynamics 365 Business Central.

If you look at this video from product group about isolation levels

BC TechDays 2023 – Locking in AL: Runtime and explicit AL control (youtube.com)

you have a deep dive into different way to use isolation levels and what temporary lowering and heightening means, when and how to use them. Below an example:

In the case exposed above, the application will run the reader query (FindLast) using WITH(UPDLOCK) hint, hence heightening the default value (READUNCOMMITTED) to a higher locking one – actually the highest – .

We did the same for this case, and applied a different isolation level. Falling back to a higher locking scenario, similar to what it was before the upgrade (FYI, it should have been UPDLOCK).

In shorts, Tri-state locking has been left enabled while for that specific procedure we have applied a temporary heightening from READCOMMITTED to REPEATABLEREAD. In other words, we simply try to mimic a higher locking scenario, similar to what was running back in 21.5 original version. Something like the following:

OldStartingDateTime := ProdOrderRtngLine2."Starting Date-Time";
ProdOrderRtngLine2."Routing Status" := ProdOrderRtngLine2."Routing Status"::Planned;
ProdOrderRtngLine2."Starting Date" := DT2Date(StartingDateTime);
ProdOrderRtngLine2."Starting Time" := DT2Time(StartingDateTime);
PlanningMgt.FindNextStartingDateTime(ProdOrderRtngLine2);
ProdOrderRtngLine2.Modify();

ProdOrderRtngLine2.ReadIsolation := IsolationLevel::RepeatableRead;
ProdOrderRtngLine2.Validate(ProdOrderRtngLine2."Starting Date");

ProdOrderRtngLine2.Get(ProdOrderRtngLine2.Status, ProdOrderRtngLine2."Prod. Order No.", ProdOrderRtngLine2."Routing Reference No.", ProdOrderRtngLine2."Routing No.", ProdOrderRtngLine2."Operation No.");
UpdateRoutingDateTime(ProdOrderRtngLine2);

ProdOrderRtngLine2.Get(ProdOrderRtngLine2.Status, ProdOrderRtngLine2."Prod. Order No.", ProdOrderRtngLine2."Routing Reference No.", ProdOrderRtngLine2."Routing No.", ProdOrderRtngLine2."Operation No.");
UpdateRoutingDateTime(ProdOrderRtngLine2);

and everything back to work smoothly. Finger snap.

Meanwhile rewriting and carefully testing the beastie, of course.

CONCLUSION

When deciding to enable tri-state locking (and RCSI) be sure to test everything carefully. I know that something might slip out or could be very specific and tightened to a particular procedure.

When this happens, if you are under pressure and you know that the issue is due to this feature enablement, consider refactoring your code applying a different isolation level, depending on the code flow.

Typically, a temporary heightening might be needed to fall back from READCOMMITTED (with RCSI enabled) to REPEATABLEREAD. Similar to what it was before enabling tri-state locking.

This should be able to unblock users and give more time for code review, refactoring and implement suitable and durable tests.

Leave a comment

Blog at WordPress.com.

Up ↑