The Myers diff algorithm: part 1

akoboldfrying · on March 14, 2024

Good to see this interesting and very fundamental topic getting attention, and looking forward to the next post, where hopefully the algorithm's cleverness will be explored.

>It does this by being greedy, that is trying to consume as many lines that are the same before making a change (therefore avoiding the “wrong end” problem), and also by preferring deletions over insertions when given a choice, so that deletions appear first.

Not really. The algorithm guarantees to find an alignment/diff/edit script of minimal length, but whether it prefers the "leftmost" such alignment ("wrong-end" here and in many similar cases in source code) or the "rightmost" ("right-end" here) is not a crucial aspect of how it works. The preference can be switched by reversing both inputs before diffing and then reversing the output afterwards -- but that's also the case for any other algorithm that solves this problem optimally.

Myers's algorithm is greedy in another sense, however: It greedily hunts first for a solution in which no edits are necessary, then solutions in which 1 edit (insertion or deletion) is necessary, then 2, etc., terminating as soon as it finds a solution. Clearly this leads to an optimal (smallest-possible) diff. It also means that it takes time proportional to the amount of difference between the two inputs, which is the clever part: This is what makes it much faster than other approaches in the common case where the two inputs are highly similar.

travisgriggs · on March 14, 2024

It was posted in 2017. You can get to part 2 from the link at the bottom.

ynik · on March 14, 2024

But there's also the not-so-common (yet still somewhat frequent) case where the inputs are highly different. Myers diff alone would take quadratic time, which in practice is often unacceptable.

Real-world implementations often have a limit on the amount of difference (e.g. max(256, sqrt(input size))), and if that's exceeded, they switch to heuristics (so the resulting diff is no longer guaranteed to be minimal). This reduces the worst-case runtime from O(N*2) to O(N*1.5).

Gehinnn · on March 13, 2024

Here's how we implemented the Myers diff algorithm in VS Code: https://www.youtube.com/live/yWy-0TNVsLg

On top of the core algorithm, we have many heuristics to show much nicer differs in many cases.

mananaysiempre · on March 14, 2024

> we have many heuristics to show much nicer diffs in many cases

Anything related to the “patience” or “histogram” variations as used e.g. in Git?

parentheses · on March 14, 2024

Also interested in this.

teabee89 · on March 13, 2024

A while ago I discovered the lesser known Tichy diff algorithm that seems to preserve more context and is better suited for big code refactors: https://www.researchgate.net/publication/220439403_The_Strin... via https://bryanpendleton.blogspot.com/2010/04/more-study-of-di...

3abiton · on March 14, 2024

Any current implementations?

hinkley · on March 14, 2024

“Tichy diff” pulls up a helluva lot of C. difficile links on DDG. Sheesh.

rob74 · on March 14, 2024

I was expecting this guy to come up actually: https://en.wikipedia.org/wiki/Ijon_Tichy

kwhitefoot · on March 14, 2024

> C. difficile

Difficult C? :-)

moreati · on March 14, 2024

Runny C.

> Clostridium difficile (C. diff) is a type of bacteria that can cause diarrhoea

https://www.nhs.uk/conditions/c-difficile/

microtherion · on March 14, 2024

Think of it as a highly runtime optimized diff.

hinkley · on March 14, 2024

It’s not diarrhea, it’s life changing diarrhea. People get hospitalized for that shit (pun not intended)

parentheses · on March 14, 2024

I used to use diff as an interview question. Implementing a simple algorithm that produces reasonable (not necessarily minimal) diffs is fairly straightforward. If you're on track for an ok-to-good score in implementing diff, it leaves just the right amount of time to have a rich discussion about optimizing and other approaches.

I feel like this makes the interview a balanced test for programming capability, problem solving and communication. I tested it quite a bit on folks I knew and it had some of the lowest rates of false positives. The "candidates" enjoyed it and felt they solved something much more interesting to them, leaving a memory they carry through their software engineering lives. The ones I tested it on have actually mentioned it in future conversations.

Interviews today are too focused on maximum time spent coding. Instead verifying ability to code is quick and we focus more on improving it and use conversation to test more high order bits.

pcl · on March 14, 2024

I really like these sorts of interview questions. Something with a spectrum of solutions, interesting algorithms, and room for discussion and exploration.

I also like asking interviwees questions that are intentionally missing some amount of detail. Nothing that makes the problem a dead end, but enough vagueness to give people a chance to ask clarifying questions and / or make assumptions of their own.

o11c · on March 14, 2024

Line-based diff algorithms are fine, but what the world really needs is more study into tree-based diffs.

This does not require per-language implementations of `diff`, only per-language implementations of text-to-some-well-known-tree.

parentheses · on March 14, 2024

I agree. Semantic diff is very useful. Tangentially, I'm surprised that the terraform diff cannot be displayed in a more user-friendly manner.

dekhn · on March 14, 2024

Myers also created the BLAST algorithm which was widely used in biology (still is, but it was used more in the past)

slyrus · on March 14, 2024

And Gene wrote the assembly tools that did shotgun assembly of the human genome for Celera when most folks (except Jim Kent who wrote the _other_ assembler (used by the public sequencing effort (NIH,Broad,UCSC, etc…)) said it couldn’t be done. IMO, he and Jim Kent deserve a Nobel prize for these efforts.

dekhn · on March 14, 2024

True. Celera had a large TruCluster, machines with 8 (16?) processors, 64GB and inter-node cluster networking for a truly shared filesystem. Shotgun at the time required very large memory many-CPU high throughput IO machines (although 64GB isn't really a large memory machine now) while Kent's approach worked fine on clusters but IIRC was tuned for the specific type of scaffold sequencing done by the public project.

From the original celera paper, an endnote describing what was pretty impressive hardware for the time:

Celera’s computing environment is based on Compaq Computer Corporation’s Alpha system technology running the Tru64 Unix operating system. Celera uses these Alphas as Data Servers and as nodes in a Virtual Compute Farm, all of which are connected to a fully switched network operating at Fast Ethernet speed (for the VCF) and gigabit Ethernet speed (for data servers). Load balancing and scheduling software manages the submission and execution of jobs, based on central processing unit (CPU) speed, memory requirements, and priority. The Virtual Compute Farm is composed of 440 Alpha CPUs, which includes model EV6 running at a clock speed of 400 MHz and EV67 running at 667 MHz. Available memory on these systems ranges from 2 GB to 8 GB. The VCF is used to manage trace file processing, and annotation. Genome assembly was performed on a GS 160 running 16 EV67s (667MHz) and 64 GB of memory, and 10 ES40s running 4 EV6s (500 MHz) and 32 GB of memory. A total of 100 terabytes of physical disk storage was included in a Storage Area Network that was available to systems across the environment. To ensure high availability, file and database servers were configured as 4-node Alpha TruClusters, so that services would fail over in the event of hardware or software failure. Data availability was further enhanced by using hardware- and software-based disk mirroring (RAID-0), disk striping (RAID-1), and disk striping with parity (RAID-5).

inciampati · on March 14, 2024

What's more, this isn't history. Code from the Celera assembler lives on in a lineage of assembly methods (Canu, HiCanu, Verkko) which have ultimately _completely automated the process of complete genome assembly_ https://doi.org/10.1038/s41587-023-01662-6. The fact that this assembly approach remained relevant until practical resolution of the assembly problem is a testament to its solid theoretical foundation (the string graph) which relates read length, error rate, and information theoretic limits of genome assembly https://doi.org/10.1093/bioinformatics/bti1114.

sausagefeet · on March 14, 2024

This talk by Gene on developing it is such a great watch:

https://youtu.be/pVFX3V0Q2Rg?si=vF-Z5dOd0IqgPFtP

So many innovations have been made in realizing that you'll be able to do it in software if you just wait a few years.