| Summary: | Migrate from Subversion to Git for packaging sources | ||
|---|---|---|---|
| Product: | Infrastructure | Reporter: | Neal Gompa <ngompa13> |
| Component: | Others | Assignee: | Augier <christophe> |
| Status: | NEW --- | QA Contact: | |
| Severity: | normal | ||
| Priority: | Normal | CC: | mageia, olav, sysadmin-bugs, thierry.vignaud |
| Version: | unspecified | ||
| Target Milestone: | Mageia 7 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Source RPM: | CVE: | ||
| Status comment: | |||
|
Description
Neal Gompa
2017-02-25 17:59:03 CET
Neal Gompa
2017-02-25 17:59:43 CET
Target Milestone:
--- =>
Mageia 7 There's also a third step of porting Mageia's packagers tooling from SVN to git. Like `mgarepo`. (In reply to Augier from comment #1) > There's also a third step of porting Mageia's packagers tooling from SVN to > git. Like `mgarepo`. That's the part about "packager tooling". :) > That's the part about "packager tooling". :)
Oh ! Well... Héhé...
*Quickly disappears*
By looking at the previous works, in particular `svn-git-migration`[1], I found an interresting resource explaining how to migrate an SVN repo to a git one[2]. It seems like `svn-git-migration` was used to migrate the software repos. I could totally use some of these bash scripts, but as this is not a hard to develop, and that there are more work than this repo migration[3], I prefer to rewrite them using Python. I was also pointed a previous work named `sv2git`[4] which is based on a KDE's work but is written in C++. Considering my level in C++ and the fact that this code seems to perform the migration by hand instead of using `git svn`, it does not seem wise for me to use this previous work. [1]: http://gitweb.mageia.org/software/infrastructure/svn-git-migration/ [2]: http://john.albin.net/git/convert-subversion-to-git [3]: In particular, there is the work to comply to dist-git structure. [4]: http://gitweb.mageia.org/software/infrastructure/svn2git/about/ Just to ensure the dist-git structure and its sources are understood, here's the template layout: <pkg> |- cauldron <- SVN packages/cauldron/<pkg> |- mga6 <- SVN packages/updates/6/<pkg> (doesn't exist yet!) |- mga6-backport <- SVN packages/backports/6/<pkg> (doesn't exist yet!) |- mga6-infra <- SVN packages/updates/infra_6/<pkg> (doesn't exist yet!) |- mga5 <- SVN packages/updates/5/<pkg> |- mga5-backport <- SVN packages/backports/5/<pkg> |- mga5-infra <- SVN packages/updates/infra_5/<pkg> |- mga4 <- SVN packages/updates/4/<pkg> |- mga4-backport <- SVN packages/backports/4/<pkg> |- mga4-infra <- SVN packages/updates/infra_4/<pkg> |- mga3 <- SVN packages/updates/3/<pkg> |- mga3-backport <- SVN packages/backports/3/<pkg> |- mga3-infra <- SVN packages/updates/infra_3/<pkg> |- mga2 <- SVN packages/updates/2/<pkg> |- mga2-infra <- SVN packages/updates/infra_2/<pkg> |- mga1 <- SVN packages/updates/1/<pkg> |- mga1-infra <- SVN packages/updates/infra_1/<pkg> |- misc <- SVN packages/misc/<pkg> Also of note, our dist-git system's fallback checksum will be sha1 rather than md5, so we can just rename sha1.lst to "sources" as we will have our tools Do The Right Thing(TM) here. Going forward, we may choose to move to sha512, as Fedora did. Colin already has worked on git migration. You should ask him what he's done. CC:
(none) =>
mageia, thierry.vignaud Yeah I started this a few years back. The migration is a massive headfuck as it involved quite a lot of history. I think I didn't bother with the Mandriva history as I did with the tools merge as that would just be a total nightmare. Sadly the packages repo is a different beast to the software repos. The tools used to convert the software repo don't really scale to the size of the packages repo (the git-svn stuff is pretty much horribly inefficient). Fortunately, the KDE guys wrote a tool to migrate their svn to git some time ago. I forked this and made some changes to make it work for our repos: http://gitweb.mageia.org/software/infrastructure/svn2git/ I left some notes here in my poorly written/parsed Markdown :D http://gitweb.mageia.org/software/infrastructure/svn2git/about/ I also spoke to Fedora folks about the sha1 vs md5 vs sha512 a while back. I think we can/should switch to sha512. We're in the fortunately posistion that we've never removed anything from our binrepo. We could quite easily do the following: 1. Take current binrepo and code in the ability to generate sha512 sums on upload and store both sha1 and sha512 (using hardlinks, but also store a from sha1 -> sha512 (easier than finding matching hardlinks in a filesystem) in some for of DB 2. As part of the conversion above (perhaps as a final filter-branch), we do a lookup of sha1->sha512 and thus "history will be rewritten" as sha512 sums instead. That's one option. The other is just to slowly migrate after git conversion. Either way, perhaps if further discussions are needed I can arrange to be around for interactive chats. I tend not to be on IRC much these days and my mail is often filtered so I don't look too often, but I'll try and keep this bug in mind for further chats. FWIW, I have a (now rather outdated) faked version of our packages SVN repo to run tests on. i.e. it only contains a few packages and has some renames and branches etc. to test some of the corner cases. I can supply this if it helps Neal's testing? Running the script on the real SVN repo is not something you want to do regularly when testing - ideally we'd only do it once! Oh, for the avoidance of doubt, the "skipping revisions" feature I added was to tidy up the mistake made when all of cauldron in svn was accidentally svn rm'ed, then restored again in the next commit. That wouldn't look good in git if migrated! (In reply to Augier from comment #4) > By looking at the previous works, in particular `svn-git-migration`[1], I > found an interresting resource explaining how to migrate an SVN repo to a > git one[2]. It seems like `svn-git-migration` was used to migrate the > software repos. I could totally use some of these bash scripts, but as this > is not a hard to develop, and that there are more work than this repo > migration[3], I prefer to rewrite them using Python. > > I was also pointed a previous work named `sv2git`[4] which is based on a > KDE's work but is written in C++. Considering my level in C++ and the fact > that this code seems to perform the migration by hand instead of using `git > svn`, it does not seem wise for me to use this previous work. > > [1]: http://gitweb.mageia.org/software/infrastructure/svn-git-migration/ > [2]: http://john.albin.net/git/convert-subversion-to-git > [3]: In particular, there is the work to comply to dist-git structure. > [4]: http://gitweb.mageia.org/software/infrastructure/svn2git/about/ Just to reply to this specifically, I'd strongly suggest NOT using git-svn in any way for this migration. It's is horribly inefficient and would likely take months of computation time to make even a dent in our packages repo. You really do have to go lower level and parse each revision one at a time and split it into packages, rather than taking one package and looking for it's changes across all the revisions (this is the main difference between svn2git and git-svn). Yes it's written in C++ and I'm sure you could rewrite it in python, but I suspect that's not needed, nor really worth the effort. The tool is mostly working, it just doesn't handle all corner cases well. The main issue is it doesn't handle renames very well. e.g. when a package is renamed with an svn mv. I think adding support for this would be fairly straight forward. The other issue is that it doesn't handle copying specific revisions from svn and preserving history nicely - e.g. when resurrecting a package that was obsolete. So all that's really needed is for someone to take the code and run with it with some fairly small changes/modification. I suspect rewriting it in python would take considerably longer (and again, you cannot rely on shelling out to git-svn here - it's just not scalable - you pretty much have to take the same approach as the C++ code, but just reimplement it). There are certain, post-conversion tasks that certainly could be automated in python scripts. e.g. the filter branch to do the final conversion to the dist-git layout on each repo & the importing of the final (static) changelog for example. And the final verification and comparison to final svn state for each package (possibly done before the filter branch so the layouts are easier to compare) could all be automated via a nice python wrapper (although as with the conversion itself, particular care will have to be taken to scalability) FWIW, I encoded the layouts for dist-git here: http://gitweb.mageia.org/software/infrastructure/svn2git/tree/rules/mga-pkgs.rules?h=distro/mga This was before I learned of the term "dist-git" - I was just copying the general fedora layout :D The layout suggested in this bug report is similar but not identical. The rules could be updated and I have no strong opinion of which is best, but I would suggest we try to stick to the distsuffix we use in the rpms (e.g. mga5 and mga5.infra etc. - only difference to above is dots rather than hyphens) FWIW I still think we want to have the commit messages as our changelogs (cf fedora which encodes them in the .spec). THis requires some changes to our srpm generate which currently uses svn log (and any legacy changelog). This will need changed to work with git log. We currently use svn revprop to "edit" incorrect svn log messages. This isn't possible with git. We can however use git notes which could provide an alternative commit message for any given commit if an error was detected. The generation of the srpm changelog will therefore be a bit more involved, but still totally possible if switched to git log+notes. Hope all this is useful. (In reply to Neal Gompa from comment #5) > Just to ensure the dist-git structure and its sources are understood, here's > the template layout: > > <pkg> > |- cauldron <- SVN packages/cauldron/<pkg> > |- mga6 <- SVN packages/updates/6/<pkg> (doesn't exist yet!) > |- mga6-backport <- SVN packages/backports/6/<pkg> (doesn't exist yet!) > |- mga6-infra <- SVN packages/updates/infra_6/<pkg> (doesn't exist yet!) > |- mga5 <- SVN packages/updates/5/<pkg> > |- mga5-backport <- SVN packages/backports/5/<pkg> > |- mga5-infra <- SVN packages/updates/infra_5/<pkg> > |- mga4 <- SVN packages/updates/4/<pkg> > |- mga4-backport <- SVN packages/backports/4/<pkg> > |- mga4-infra <- SVN packages/updates/infra_4/<pkg> > |- mga3 <- SVN packages/updates/3/<pkg> > |- mga3-backport <- SVN packages/backports/3/<pkg> > |- mga3-infra <- SVN packages/updates/infra_3/<pkg> > |- mga2 <- SVN packages/updates/2/<pkg> > |- mga2-infra <- SVN packages/updates/infra_2/<pkg> > |- mga1 <- SVN packages/updates/1/<pkg> > |- mga1-infra <- SVN packages/updates/infra_1/<pkg> All fine IMO, but I would use dots in the git branch names to match the distsuffix used in the generate RPMs (very minor nitpick) > |- misc <- SVN packages/misc/<pkg> I don't think we should import this "branch" at all. It's only used to store static changelogs about each package AFIAK. These should IMO be incorporated as part of each package repo's final filtering to inject just the latest version of this changelog file (as it generally doesn't change after initial SRPM import) into each and every branch we have in the git repo as a changelog.txt file). Then the SRPM generation for each can include it (combining it with git log+notes as we do currently with svn log and as mentioned above). > Yes it's written in C++ and I'm sure you could rewrite it in python, but I suspect that's not needed, nor really worth the effort. The tool is mostly working, it just doesn't handle all corner cases well. The main issue is it doesn't handle renames very well. e.g. when a package is renamed with an svn mv. I think adding support for this would be fairly straight forward. The other issue is that it doesn't handle copying specific revisions from svn and preserving history nicely - e.g. when resurrecting a package that was obsolete.
You really overestimate my C++ skills. It could take ages before I understand the code and how to use it.
Yes, git-svn takes a lot of time. But we perfectly can leverage the problem by parallelising the task, which is pretty straightforward using Python 3.6.
Furthermore, I know very few about SVN (which is why I'm *really* motivated porting the stuff to git). (In reply to Colin Guthrie from comment #11) > (In reply to Neal Gompa from comment #5) > > Just to ensure the dist-git structure and its sources are understood, here's > > the template layout: > > > > <pkg> > > |- cauldron <- SVN packages/cauldron/<pkg> > > |- mga6 <- SVN packages/updates/6/<pkg> (doesn't exist yet!) > > |- mga6-backport <- SVN packages/backports/6/<pkg> (doesn't exist yet!) > > |- mga6-infra <- SVN packages/updates/infra_6/<pkg> (doesn't exist yet!) > > |- mga5 <- SVN packages/updates/5/<pkg> > > |- mga5-backport <- SVN packages/backports/5/<pkg> > > |- mga5-infra <- SVN packages/updates/infra_5/<pkg> > > |- mga4 <- SVN packages/updates/4/<pkg> > > |- mga4-backport <- SVN packages/backports/4/<pkg> > > |- mga4-infra <- SVN packages/updates/infra_4/<pkg> > > |- mga3 <- SVN packages/updates/3/<pkg> > > |- mga3-backport <- SVN packages/backports/3/<pkg> > > |- mga3-infra <- SVN packages/updates/infra_3/<pkg> > > |- mga2 <- SVN packages/updates/2/<pkg> > > |- mga2-infra <- SVN packages/updates/infra_2/<pkg> > > |- mga1 <- SVN packages/updates/1/<pkg> > > |- mga1-infra <- SVN packages/updates/infra_1/<pkg> > > All fine IMO, but I would use dots in the git branch names to match the > distsuffix used in the generate RPMs (very minor nitpick) > I don't think backports and infra have distsuffixes that differ, so I don't think it matters from that point of view. If they *do have different disttags*, then I agree, and I'd go for that. And as for core/tainted/nonfree, it doesn't really exist as a separate entity from the VCS point of view anyway. We'll probably wire up some fancy magic for double-submit for YOURI/Koji to handle this when certain conditionals exist in the spec file. > > |- misc <- SVN packages/misc/<pkg> > > I don't think we should import this "branch" at all. It's only used to store > static changelogs about each package AFIAK. These should IMO be incorporated > as part of each package repo's final filtering to inject just the latest > version of this changelog file (as it generally doesn't change after initial > SRPM import) into each and every branch we have in the git repo as a > changelog.txt file). Then the SRPM generation for each can include it > (combining it with git log+notes as we do currently with svn log and as > mentioned above). That is certainly another approach. The main thing I want to avoid is unnecessary duplication of content, especially for something that's (generally) frozen on import. Though, it occurs to me that with how git checkouts work, it might be tricky to simultaneously check out the content of both for the required re-merge of changelogs at spec file rebuild time. I'd also just call it <pkg>.rpmchangelog or something like that, just to keep it unique and obvious. (In reply to Augier from comment #12) > You really overestimate my C++ skills. It could take ages before I > understand the code and how to use it. > Yes, git-svn takes a lot of time. But we perfectly can leverage the problem > by parallelising the task, which is pretty straightforward using Python 3.6. That won't help as much as you think: 1) the SVN server is a slowness contention point 2) svn2git already run fastimport processes concurrently Lot of people who do know svn & git spend time on speeding up the conversion. I don't remember the details nor can I find a link but I do remember that svn2git did make a _HUGE_ difference back in the days when KDE did the big migration to KDE So if you "know very few about SVN", I doubt you'll be able to beat them... Please don't reinvent the wheel. There's a working tool, let's just use it. > So if you "know very few about SVN", I doubt you'll be able to beat them...
> Please don't reinvent the wheel.
> There's a working tool, let's just use it.
As far as I understand, the tool needs polishing. That's what bothering me a bit.
@Thierry Vignaud: do you know where the rules language of the svn2git is documented? (In reply to Neal Gompa from comment #6) > Also of note, our dist-git system's fallback checksum will be sha1 rather > than md5, so we can just rename sha1.lst to "sources" as we will have our > tools Do The Right Thing(TM) here. Going forward, we may choose to move to > sha512, as Fedora did. Recommend the following: - Upgrade Python to 3.6, so you gain sha3 functions - Use SHA3 - Store and verify file length together with the hash SHA1 is known to be insecure for a pretty long time. SHA2 is also weakened at this stage. See http://valerieaurora.org/hash.html CC:
(none) =>
olav (In reply to Augier from comment #16) > > So if you "know very few about SVN", I doubt you'll be able to beat them... > > Please don't reinvent the wheel. > > There's a working tool, let's just use it. > > As far as I understand, the tool needs polishing. That's what bothering me a > bit. (In reply to Augier from comment #17) > @Thierry Vignaud: do you know where the rules language of the svn2git is > documented? I can completely understand where you're coming from, but now that you know the svn repo is ~700Gb I think you're appreciating the scale of the issue. FYI, I did also do a git-svn of our packages repo some time ago but it only included the specs, not anything else. After several days of running it, I had to give up. Even once it had been converted, the performance of the git repo was itself horrendous. A git log took >30s to start and that was with a reasonably fast machine and an SSD! So all things told, git-svn is definitely not the right tool for the job. As I said, you could rewrite svn2git in python but I think that would be a pretty time consuming task. I won't be able to do much in the short term, but after mid-April I might be able to help polish the tool a bit in terms of C++ work, provided you're able to do the testing and verification stuff? To be honest, the code itself is quite simple and I'm sure even with moderate python skills you should be able to dip in and do some tweaks even if larger tweaks are tricky. I'll look out a partial svn repo dump for you that I used during my testing. So apparently there's an actual "dist-git" project[1] that provides an implementation of the Dist-Git backend used in Fedora (both the main project and for COPR). Might be worth leveraging? [1]: https://github.com/release-engineering/dist-git Ok, I was able to test svn2git thanks to Colin's instruction. Seems to work quite well and it is awsomely fast. For those who wants to see what it's like, I've made up a test repo. Just do the following:
$ git clone --recursive https://github.com/christophehenry/svn-2-dist-git.git
$ cd svn-2-dist-git/svn2git
$ qmake && make
$ cd ../test/mga-packages-git
$ ./test.sh
Next I'm going to dig up the tool a bit to generate git repos in the good dist git format. Should not be difficult as Colin already wrote rules that are close to it.
BTW, Colin, can you tell me where that rules language is documented?
(In reply to Colin Guthrie from comment #11) > (In reply to Neal Gompa from comment #5) > > Just to ensure the dist-git structure and its sources are understood, here's > > the template layout: > > > > <pkg> > > |- cauldron <- SVN packages/cauldron/<pkg> > > |- mga6 <- SVN packages/updates/6/<pkg> (doesn't exist yet!) > > |- mga6-backport <- SVN packages/backports/6/<pkg> (doesn't exist yet!) > > |- mga6-infra <- SVN packages/updates/infra_6/<pkg> (doesn't exist yet!) > > |- mga5 <- SVN packages/updates/5/<pkg> > > |- mga5-backport <- SVN packages/backports/5/<pkg> > > |- mga5-infra <- SVN packages/updates/infra_5/<pkg> > > |- mga4 <- SVN packages/updates/4/<pkg> > > |- mga4-backport <- SVN packages/backports/4/<pkg> > > |- mga4-infra <- SVN packages/updates/infra_4/<pkg> > > |- mga3 <- SVN packages/updates/3/<pkg> > > |- mga3-backport <- SVN packages/backports/3/<pkg> > > |- mga3-infra <- SVN packages/updates/infra_3/<pkg> > > |- mga2 <- SVN packages/updates/2/<pkg> > > |- mga2-infra <- SVN packages/updates/infra_2/<pkg> > > |- mga1 <- SVN packages/updates/1/<pkg> > > |- mga1-infra <- SVN packages/updates/infra_1/<pkg> Just to reply to my own comment about this. I realised I missed a rather important detail here. This layout does not specify a master branch. This would be pretty strange for at git repo to not have such a branch. Lots of git tools assume this is the case (e.g. git itself on clone, cgit, and likely a lot of others - sure they can be configured and you can manually work around this, but I think this just creates unnecessary work for ourselves). I can't think of any reason to deviate from the fedora approach and would therefore strongly suggest that the "cauldron" branch above is actually just "master". (In reply to Colin Guthrie from comment #22) > > I can't think of any reason to deviate from the fedora approach and would > therefore strongly suggest that the "cauldron" branch above is actually just > "master". I'm okay with the default branch being called "master". (In reply to Neal Gompa from comment #23) > (In reply to Colin Guthrie from comment #22) > > > > I can't think of any reason to deviate from the fedora approach and would > > therefore strongly suggest that the "cauldron" branch above is actually just > > "master". > > I'm okay with the default branch being called "master". Fedora is considering undergoing the process of migrating all "master" branches to "rawhide"[1]. I'd rather us be ahead of the game here and have our development branch be called "cauldron" for similar reasons. [1]: https://pagure.io/fesco/issue/2410 |