Sunday, August 5, 2012

Gsoc: Patch Index Week 11

This week, I have added support to user interrupt at patch index creation,
cleaned up timing and correctness scripts, run timing and correctness scripts
on xmonad repository, extended and cleaned up tests and documentation.

User Interrupt
 The user can interrupt darcs get with Ctrl-C. The rest of the patches are downloaded on when needed. However, the entire set of patches are required to build patch index. Now, when the user interrupts get with Ctrl-C, patch index will be disabled, and get will stop immediately.
 Patch index will be created automatically when patch index darcs is run on an existing repo. Now, the user can interrupt this with Ctrl-C. patch index will be disabled, and the command will continue.

Timing and Correctness Scripts
 I have written bash timing and correctness scripts. The files are in then contrib folder of my darcs patch index repository.
 patch-index-correctness.sh REPO_URL, gets the darcs repo at REPO_URL, and compares the patch index/existing changes and annotate output for every file in the directory.
 patch-index-timing.sh REPO_URL DEV_FILE MOUNT_POINT, mounts the partition of DEV_FILE at MOUNT_POINT, gets the repo at REPO_URL there, and measures the time taken by patch index/existing changes and annotate, with and without clearing the disk cache. This file needs to be run as root.
 I have run these scripts on xmonad repository. The correctness test passed, and the result for the timing tests are here: changes, and annotate. Annotate is 63(cold cache)-74(hot cache)% faster, and changes is 45(cold cache)-78(hot cache)% faster on average. This is comparatively slower than on darcs repository(changes 78-92% faster, and annotate 86-94% faster). This is mostly because the number of patches in darcs is larger. A larger repository will get larger gains. I am going to run the timing test on ghc and agda repositories this week.

 Next week I will finish up documentation and elaborate and add more tests.
 My darcs repository is here.

Sunday, July 29, 2012

Gsoc: Patch Index Week 10

 This week, I have improved tests and documentation on patch index structures, completed timing tests on changes and annotate, and made some ui fixes.

 Patch index structures
 Patch index maintains four data structures. I have written a subcommand to test the logical consistency of the structures. Invoke it using "darcs show patch-index-test". This subcommand can run on any darcs repository with patch index. I have tested darcs development repository and a few test repositories using this command, and found that it passed all tests.
 I have expanded the shell test on patch index structures. The shell test creates a new darcs repository, and uses the additional knowledge of the exact steps involved to test the structures more thoroughly.
 As per Eric's suggestion, I have added examples of the contents of these structures to the documentation.

Timing Tests
  I have completed the timing tests of changes and annotate. The time taken by changes or annotate varies according to the contents of the disk cache of the operating system. I have measured both extremes of the time taken, with and without using patch index. On average, patch index saves 78%-92% of time taken on changes, and  86%-94% on annotate.
 The exact results for changes and annotate are here.

UI Fixes
  I have added support to Ctrl-C at get. Now, if you stop getting the patches in the middle of get, patch index will be disabled.
 I have renamed --no-patch-index to the more descriptive --disable-patch-index.

Next week, I will:
  1. Port latest changes to Eric's newest rebase: Eric has just rebased the patch index repository. However, there are a few new changes I have made after that. I will port the new changes, and start using the rebase.
  2. Support Ctrl-C at patch index creation on existing repositories: Patch index will be automatically created when patch index darcs is run on a existing repository. I intend to allow the user to interrupt in the middle, if he wishes to disable patch index.
  3. Fill missing Haddok documentation for some patch index functions.
  4. Document the various tests on patch index. 
The latest patch index repository is at: http://darcsden.com/bsrkaditya/darcs-patch-index

Sunday, July 22, 2012

Gsoc: Patch Index Week 9


 This week I have wrote a patch to fix annotate in screened, did timing tests with patch index on clean disk caches, wrote a shell test on patch index internal structures, and made some changes to the UI.

 Annotate on directories

  When annotate is applied on a directory, annotate will output details of the last modifying patch for each file in the directory.
 However, annotate fails in some files, and gives "unknown" as the patch. I have written a bug fix, and ported it to darcs development repository. The patch got reviewed and accepted.

 UI changes

  I have simplifed the UI by removing single instance disable. Now, the ui stands as thus:
  1. Enable or disable patch index using optimize --patch-index or optimize --no-patch-index
  2. Patch index will be used and updated if the patch index is enabled
  3. Patch index will be enabled automatically at repository creation. If you wish to disable patch index pass --no-patch-index.
  4. When patch index darcs is run on an existing repository, patch index will be created automatically. You can preemptively disable patch index using optimize --no-patch-index
 I have made a post in darcs users, suggesting this UI.

 Timing tests


  The time taken by darcs on a command varies with the content in the disk cache. In practice, this means that the first few commands are slower in a typical session. I have written tests which measure the time taken for patch index changes/annotate to run with a cleared disk cache. These are the results for changes and annotate. While the data for non pi changes and annotate is still incomplete, the results suggests that patch index universally improves the speed of annotate and changes, irrespective of disk cache. However, there is a concern about the time it takes to create patch index with disk cache cleaned(around 1min instead of 6sec).

 Patch index data structures tests


  Patch index stores information about the repository on the disk. I have begun writing shell tests on patch index internal structures.


 Next week, I plan to
  1. Write more tests on patch index data structures. There are two kinds of tests that can be written:
    • Testing if the patch index structures uphold some properties. These tests can be run on any arbitrary repository.
    • Create a small(but complicated) repository, and check if the information stored is valid
  2. More documentation on patch index structures
  3. Documentation on the timing tests

Sunday, July 15, 2012

Gsoc: Patch Index Week 8

  This week, I have worked on allowing a variable default for patch index, and make the failing tests of darcs development repository pass.

Variable default
 Previously, patch index is created/updated/used whenever the opportunity arises. You could disable this behavior and not create/update/use patch index, but only for that instance of the command (by passing --no-patch-index). Now, you can set the default to not use patch index.
 There are three possible states a repository can be in:

  1. patch index usage is default: A repository will, by default be created in this state. You can also manually set the repository to this state by running optimize --patch-index. Any subsequent command in this repository will use/update patch index whenever appropriate. You can not use/update patch index for a single instance by passing --no-patch-index.
  2. patch index is disabled: You can create a repository in this state by passing --no-patch-index at creation. Also, if you do a lazy get, it will be created in this state. You can manually set to this state by running optimize --no-patch-index. Any subsequent command will not use/update patch index.
  3. undefined state: If patch index darcs is run on a repository that is created by a non patch index darcs, patch index will be created, and the default state is set to 1. If you wish to set to state 2, run optimize --no-patch-index. If you wish to run patch index darcs for that instance and not set a default pass --no-patch-index.
Failing Tests
 There were 5 tests that failed in patch index darcs:
  • repair-corrupt.sh
  • repair-corrupt-add.sh
  • lazy-optimize-reorder.sh
  • issue1645-ignore-symlinks.sh
  • issue1645-ignore-symlinks-case-fold.sh
The tests failed because of two reasons:
  • These scripts pass --patch which could match to --patches or --patch-index. Script writers, take note! You may have to update your scripts, if you have a prefix of --patch-index in them.
  • The scripts have an internal setup logic that fails due to patch index changes: For example repair-corrupt, repair-corrupt-add fail because a corrupt record fails due to the extra code in patch index. Similarly lazy-optimize-reorder fails because some critical patches get downloaded in a lazy get due to the original repo having patch index. The solution is to pass --no-patch-index at the creation of the repo.
Next week I will:
  1. Redo timing tests: In my last meeting with Benedikt, he pointed out that my timings assume "hot" cache(ie, the patch files are in the disk cache of the operating system). By turning the cache "cold"(ie, resetting the disk cache of the operating system), the time taken for running the commands expand dramatically. For example, on annotate on README of darcs development repo:
    • hot pi: 0.3 sec
    • hot nopi: 3 sec
    • cold pi: 4 sec
    • cold nopi: 20 sec
    More distressingly, on cold cache, it is takes 1 minute to build patch index. Hence it is ideal to build patch index immediately with get where it takes around 6 sec.
  2. Tests on patch index integrity: There are yet not tests on the patch index data structures.
  3. Integrate timing tests with darcs-benchmark
  4. UI discussion in darcs-users mailing list.
I am not able to push to darcsden public repository. Find my patch index repository mirror here.

Sunday, July 8, 2012

Gsoc: Patch Index Week 7

This week, I have worked on testing annotate on directories, and refactoring code according to the suggestions of Benedikt and Ganesh.

Annotate on directories
 If a directory name is given as input, annotate will output the last modifying patch for each file in the directory. However, the annotate in screened often fails and gives "unknown" as the corresponding patch for a file. A simple test case:
I have corrected this bug, and built patch index annotate on top of this. I have tested patch index annotate with the corrected version, and I found that they give the same output for all folders in darcs development repository.
 However, when I tested corrected annotate with screened annotate, I found that they gave different output two files in /tests of darcs development repository.(They were no differences in the other folders)
 The files are ./tests/issue1446.sh, and ./tests/issue1248.sh. However, even though they give different patches, they both give the correct answer as per the documentation. This is because the filenames correspond to different files in the repository history. The corrected version gives the last patch modifying the most recent file, but the screened version gives the last patch modifying a previous file.   
 Eric suggested that I patch the corrected version of annotate, and see if it gets accepted into the development branch. I have done so.

Code Refactors
 Benedikt suggested some refactors, most of which are to help the code get closer to existing coding conventions. Ganesh suggested (a) refactor to help with the safety of the coerce used in patch index code.

Next week, I will work on:
  • More flexibility in the default option: Patch index should be ideally invisible to the user. Currently, it is invisible only if the user uses patch index. If he wishes to not use patch index, he has to pass --no-patch-index to nearly every command. After a discussion with Eric, and Benedikt, the solution I have in mind is this:
    • Change the patch index automatic update, so that it works only if the patch index is created. (and does not create it automatically) This means that by default, patch index will be used/updated only if it exists.
    • Give an option to delete patch index, so that the user can go back to a state where the default is to not use/update patch index.
  • Disabling patch index on lazy get: patch index gets created by default at get. However, it requires all patches to be present in the repository. --lazy is used, so that patches will not be downloaded until needed. Thus by using patch index, the purpose of lazy is defeated, as it will get all the patches.
  • Integrate the timing tests into darcs-benchmark.

Sunday, July 1, 2012

Gsoc: Patch Index Week 6

 This week, I have made user interface changes, some code refactors, solved bugs, and expanded tests.

User Interface Changes
 A darcs command may create(init, get), update(record, amend-record, push, pull, apply, unrecord, tag) or use patch index(changes, annotate). darcs will automatically create, update, or use patch index whenever applicable. To override this, use flag --no-patch-index.

Code Refactors
 I had duplicated some darcs annotate functions, and made minor modifications so that they do not use witnesses. Ganesh and I had a discussion on how to modify my code, so that I can use the original functions instead. I have implemented the suggestions, and removed the duplicate code.
 I have merged patch index annotate code into the original annotate. This removes code duplication between the two annotates. (changes was merged previously)
 I have made other minor refactors, like removing compiler warnings, and unnecessary exports.

Bug-fixes
   I have solved two bugs this week. The first is changes on directorates. I can now confirm that changes on directories works for all directories in darcs screened repository.
 The second bug-fix is for the function that checks if patch index is up to date. It used to crash if patch index was not yet created. This used to indirectly crash the function that loads patch index.

Tests
 I have written shell scripts that measure the time taken by pi annotate, changes and compare it with the existing annotate and changes. Find the result spreadsheets here: Changes Files, Changes Directories, Annotate Files.
 For now, you can expect a speed up of  6.7x for changes on files, 3.5x for changes on directories, 8.1x for annotate on files on average.
 I have written a test for patch index load when it does not exist,  and updated the test on creating and updating patch index.

Next week, I will get suggestions on the necessary refactors for integrating the code with screened(mid-term goal), and making sure annotate on directories works properly.

Sunday, June 24, 2012

Gsoc: Patch Index Week 5

 This week I worked on support for directories in patch index changes, testing the correlation between patch index and existing commands, and some minor refactoring.

Implementation of directory support in changes
 Patch index changes uses patch index to filter out the unrelated patches of the repository. A patch that does not modify any selected file gets filtered out. A file will be selected if it had a path that is a subpath of the given directory at any point in the repository history.

Tests on correlation:
I have tested for correlation on changes with files using the below script. I ran the script on darcs screened repository, and found that it succeeded on all files. A similar script for annotate has also succeeded.
Changes on directories fails to give the same output, with a simple example given below: In the above example, the patch index version fails to give the middle two patches. This is so because the file a was never under dir2 at any point in repository history. An implementation which checks if the file was in the given directory's history will solve this issue.
  However, I found that the patch index version was as slow(and sometimes slower) as the current version for src of screened. I suggest that we either:

  •  Use patch index only on small directories, and fix the above bug
  •  Change implementation of patch index, so that it directly stores the relevant patches of a directory
Testing annotate on directories proved to be difficult. This is because patch-index annotate can give all the output of current annotate and still fail. Annotate on directories could fail for a particular file or sub directory, giving an unknown as a corresponding patch. Somehow, patch index annotate may succeed for a file even when current annotate fails. The current output format makes it difficult to filter out the "wrong" cases.

I have also tested the correlation when various options are passed. I yet have to find a case where a patch-index command gives a different output when compared with the current command (when applied on a file).

For the next week, I plan to get the code up to mid term inspection. My mentors gave me the following criteria:

  1. that it does what is intended
  2. that it supports all of the options
  3. that the code itself is of push-to-mainline quality
  4. and there are adequate tests
I am to give priority to the first goal of the project (automatic update of patch index). For now, this means
  • implementing --patch-index and --no-patch-index for all relevant commands
  • fixing a bug where patch index was assumed to exist when loading patch index
  • some refactors
  • more tests
I am now using the rebased repo created by Eric. Find my fork here.