Bug 24411

Summary: Error: CUDA kernel compilation failed
Product: Mageia Reporter: Richard Walker <richard.j.walker>
Component: RPM PackagesAssignee: All Packagers <pkg-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: Normal CC: geiger.david68210, ghibomgx, marja11, mitya, tmb
Version: Cauldron   
Target Milestone: ---   
Hardware: All   
OS: Linux   
See Also: https://bugs.mageia.org/show_bug.cgi?id=24379
Whiteboard:
Source RPM: nvidia-cuda-toolkit-9.1.85-2.mga7.nonfree CVE:
Status comment:
Attachments: nvidia cuda 10 spec file
patch for nvidia cuda 10 spec file
nvidia cuda 9.2 spec file
nvidia-cuda 10.1 spec file
failed render from cauldron Blender
successful render with cuda_10.1.105_418.39_linux
successful render with cuda_10.1.105_418.39_linux and Blender 2.80 b1
blender spec file with conditional flag for building offline cuda kernels
blender spec file with conditional flag for building offline cuda kernels
nvidia cuda 10.1 spec file
blender spec file with conditional flag for building offline cuda kernels
Console output from blender build (422 lines)
Console output from home directory build of CUDA sample programs
Console output from CUDA toolkit sample build of Vulkan example
modified findvulkan.mk for mageia support
nvidiacuda 10.1 spec file
patch1 for nvidia cuda 10.1 spec file
patch2 for nvidia 10.1 cuda spec file
patch3 for nvidia cuda 10.1 spec file
patch4 for nvidia cuda 10.1 spec file
nvidia cuda 10.1 samples binaries spec file
patch4 for nvidia cuda 10.1 spec file
Console output running 2_Graphics sample programs
script for downloading current blender 2.7 git
blender spec file with conditional flag for building offline cuda kernels
rpmbuild error building blender 2.79 git
Console output with verbose nvcc
blender spec file with conditional flag for building offline cuda kernels
Console output - failed build
borked xorg.conf
It looks like it should be working
new debug xorg.conf with no EDID
results from new xorg.conf

Description Richard Walker 2019-02-23 16:37:53 CET
Description of problem:
The problem was observed during use of the Blender program when executing a GPU-assisted Cycles render. The report in the console was:
CUDA version 9.1 detected, build may succeed but only CUDA 8.0 is officially supported.
Compiling CUDA kernel ...
"nvcc" -arch=sm_61 --cubin "/usr/share/blender/2.79/scripts/addons/cycles/source/kernel/kernels/cuda/kernel.cu" -o "/home/richard/.cache/cycles/kernels/cycles_kernel_sm61_7541DDBE6B1A613331389550DF3BCB6B.cubin" -m64 --ptxas-options="-v" --use_fast_math -DNVCC -I"/usr/share/blender/2.79/scripts/addons/cycles/source" 
In file included from /usr/include/host_config.h:50,
                 from /usr/include/cuda_runtime.h:78,
                 from <command-line>:
/usr/include/crt/host_config.h:121:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported!
 #error -- unsupported GNU version! gcc versions later than 6 are not supported!
  ^~~~~
CUDA kernel compilation failed, see console for details.

Investigation shows that the failure to compile the CUDA kernel may be due to the /usr/include/crt/host_config.h file being from some time in 2017. It is possible that using headers and libraries from a more recent version of CUDA may fix the problem. 

Version-Release number of selected component (if applicable):


How reproducible:
Every time

Steps to Reproduce:
1. Update all Cauldron packages
2. Install and run Blender
3. Configure Blender to use an Nvidia GPU, requiring nvidia-current drivers, for Cycles rendering
4. Render the default scene
Comment 1 Richard Walker 2019-02-23 16:42:38 CET
Text below copied from Comment 8 in the history of bug 24379

It may solve the problem if our CUDA packages were updated to CUDA 10.0.130 but that might prevent users, with nvidia hardware requiring pre-410.48 nvidia drivers, from getting CUDA accelerated rendering in Blender.

Perhaps we should consider driver-versioned rpms for the CUDA toolkit stuff too....

Meanwhile I can show that copying the Blender.org-supplied cycles addon from blender-2.79-667033e89e7f-linux-glibc224-x86_64/2.79/scripts/addons/cycles to ~/.config/blender/2.79/scripts/addons/cycles does indeed allow the Cauldron rpm version to start and render using my nvidia gpu.
Comment 2 Marja Van Waes 2019-02-24 11:14:03 CET
(In reply to Richard Walker from comment #1)
> Text below copied from Comment 8 in the history of bug 24379

Thanks for that, I had forgotten about that bug, so didn't understand at first why you didn't file this bug agains blender ;-)
> 
> It may solve the problem if our CUDA packages were updated to CUDA 10.0.130
> but that might prevent users, with nvidia hardware requiring pre-410.48
> nvidia drivers, from getting CUDA accelerated rendering in Blender.
> 
> Perhaps we should consider driver-versioned rpms for the CUDA toolkit stuff
> too....
> 
> Meanwhile I can show that copying the Blender.org-supplied cycles addon from
> blender-2.79-667033e89e7f-linux-glibc224-x86_64/2.79/scripts/addons/cycles
> to ~/.config/blender/2.79/scripts/addons/cycles does indeed allow the
> Cauldron rpm version to start and render using my nvidia gpu.

Assigning to all packagers collectively, since there is no registered maintainer for this package.

Also CC'ing daviddavid and some committers.

See Also: (none) => https://bugs.mageia.org/show_bug.cgi?id=24379
Source RPM: nvidia-cuda-toolkit-9.1.85-2.mga7.nonfree.src.rpm => nvidia-cuda-toolkit-9.1.85-2.mga7.nonfree
Assignee: bugsquad => pkg-bugs
CC: (none) => geiger.david68210, ghibomgx, marja11, mitya, tmb

Comment 3 Giuseppe Ghibò 2019-02-26 17:35:38 CET
I had packaged cuda 10.1.130 some weeks ago, but in the end the cuda compiler didn't work at all, dunno why (CPU stay at 100% forever even compiling some basic cuda program). An alternative attempt could be to try with a release between 10.1.130 and 9.1.85, which is 9.2.148 plus its PatchLevel1, available here:

https://developer.nvidia.com/cuda-92-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Fedora&target_version=27&target_type=runfilelocal

Maybe it could have also sense to have versioned cuda toolkit Ui.e. cuda9, cuda10, etc.), i.e. the whole tree installed in /opt/cuda-<version>/ (to which usually CUDADIR points to) plus a second package with alternative softlinks.
Comment 4 Richard Walker 2019-02-26 20:02:15 CET
Giuseppe, I would be happy to try your 10.1.130 if it is still available. My hardware setup is likely quite different from yours.

AMD A10-7860 APU drives all screens with amdgpu and radeonsi drm.
Nvidia 1050TI is headless and provides only CUDA for Blender.

Would that help?
Comment 5 Giuseppe Ghibò 2019-02-26 21:23:11 CET
Created attachment 10794 [details]
nvidia cuda 10 spec file
Comment 6 Giuseppe Ghibò 2019-02-26 21:23:53 CET
Created attachment 10795 [details]
patch for nvidia cuda 10 spec file
Comment 7 Giuseppe Ghibò 2019-02-26 21:24:26 CET
Created attachment 10796 [details]
nvidia cuda 9.2 spec file
Comment 8 Giuseppe Ghibò 2019-02-26 21:38:36 CET
I attached the spec files for cuda 10 and 9.2.148; spec file for release 9.2.148 should be further completed to merge the Patch1 from upstream, that I missed. Resulting src.rpm for 10.0 is near 1.9GB, and for 9.2.148 is 1.6GB, and I'm little bit short of upload bandwidth to upload the .src.rpm somewhere. But package can be easily be built from the spec files above, downloading the nvidia binaries from 
https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
Comment 9 Richard Walker 2019-02-26 21:47:48 CET
OK, downloading from nvidia now. I'll see if I can make sense of the spec files while I am waiting:~)

Thanks
Comment 10 Richard Walker 2019-02-26 22:58:43 CET
Sorry Giuseppe, I hit a snag:

/usr/bin/install: cannot stat '/home/richard/rpmbuild/SOURCES/nvvp.desktop': No such file or directory
error: Bad exit status from /home/richard/rpmbuild/tmp/rpm-tmp.0sHbbZ (%install)


These are mentioned in the spec file:

Source2:	nvidia
Source10:	nvvp.desktop
Source11:	nsight.desktop


I don't think I have any of those. What are they?

Sorry if there is an obvious answer but I am not too literate in the rpmbuild skills.
Comment 11 Giuseppe Ghibò 2019-02-26 23:12:37 CET
Those files are unchanged from previous release, you can pick up from previous src.rpm package or here: http://svnweb.mageia.org/packages/cauldron/nvidia-cuda-toolkit/current/SOURCES/
Comment 12 Richard Walker 2019-02-26 23:46:57 CET
OK, rebuilding now. Could take a while, but I think I still see a problem in /usr/include/crt/host_config at line 127:

#if __GNUC__ > 7

#error -- unsupported GNU version! gcc versions later than 7 are not supported!

#endif /* __GNUC__ > 7 */


What do you think? is it worth hand-hacking that to pass our gcc and see what happens?
Comment 13 Richard Walker 2019-02-27 00:24:48 CET
There's a relevant discussion of the gcc 7 -v- gcc 8 problem at
https://stackoverflow.com/questions/53344283/gcc-versions-later-than-7-are-not-supported-by-cuda-10-qt-error-in-arch-linux/53828864#53828864

It looks like the safest thing to do is use gcc 7!

We have gcc 5.5 in MGA6 so I am guessing that it was a Cauldron change from 7 to 8 which provoked the CUDA kernel build failure in Blender.

Did we have gcc 7 in Cauldron a few weeks ago? Can we get it back? They say it is possible to install multiversions of gcc and use the 'update alternatives' mechanism to establish a default compiler. 

The trick then would be to arrange somehow that CUDA always gets gcc 7 for working with nvcc.

I don't see this problem going away unless Nvidia does something about it. We cannot distribute a fully functioning Blender if the user cannot build the required CUDA kernel. Blender.org distributes a number of pre-built CUDA kernels with its binary downloads so we could always advise Blender users to get one of those to have a working CUDA Cycles renderer, but seriously, that would be an awful solution. 

On balance, bringing back gcc 7 seems the least bad answer.
Comment 14 Richard Walker 2019-02-27 01:05:55 CET
I have installed my build of your CUDA 10 spec file. 

I tested it first by rendering a model for which I had a timing from last night. The model rendered successfully within a second of the time taken using CUDA 9.1. The program was yesterday's Blender 2.80 beta1.

My second test was a simple render of the default cube in the current Mageia Cauldron Blender 2.79 (git, so not 2.79b) which I have "fixed" by copying the Cycles addon (includes CUDA pre-built kernels) to my user addons directory from a Blender.org 2.79 nightly build. This rendered correctly.

Next I removed my Cycles addon to create the situation a normal Mageia Blender user would encounter and re-tested the default cube render with the default Cauldron Blender. I got the previously noted "CUDA kernel compilation failed" error.

Finally I changed the gcc version test in /usr/include/crt/host_config.h so that our gcc 8 would pass and repeated the previous test. This time the render failed with these errors:

CUDA version 10.0 detected, build may succeed but only CUDA 8.0 is officially supported.
Compiling CUDA kernel ...
"nvcc" -arch=sm_61 --cubin "/usr/share/blender/2.79/scripts/addons/cycles/source/kernel/kernels/cuda/kernel.cu" -o "/home/richard/.cache/cycles/kernels/cycles_kernel_sm61_7541DDBE6B1A613331389550DF3BCB6B.cubin" -m64 --ptxas-options="-v" --use_fast_math -DNVCC -I"/usr/share/blender/2.79/scripts/addons/cycles/source" 
/usr/include/c++/8.3.0/type_traits(1049): error: type name is not allowed

/usr/include/c++/8.3.0/type_traits(1049): error: type name is not allowed

/usr/include/c++/8.3.0/type_traits(1049): error: identifier "__is_assignable" is undefined

3 errors detected in the compilation of "/tmp/tmpxft_00005922_00000000-6_kernel.cpp1.ii".
CUDA kernel compilation failed, see console for details.


So it looks like gcc 8.3.0 really will not work. It must be 7.x.x. However, the Blender.org CUDA kernels get you over the first hump and the limited rendering I have done with the CUDA 10 toolkit is successful.
Comment 15 Giuseppe Ghibò 2019-02-27 12:31:55 CET
As for gcc, currently there is gcc 8.3.0, and I think it will be the final system compiler version in cauldron/mga7. Consider that gcc 8.x has been introduced in cauldron a lot of time ago (24 Jul 2018).

As you tried, faking gcc version support hadn't worked. Howewer I noticed that it has just been out cuda 10.1.105. Actually the format of the internal tree has been changed a bit so the nvidia-cuda-toolkit.spec should be reworked, but anyway it has host_config.h containing:

#if __GNUC__ > 8

#error -- unsupported GNU version! gcc versions later than 8 are not supported!

#endif /* __GNUC__ > 8 */

so gcc 8.x should be at least officially supported in that cuda version.
Comment 16 Giuseppe Ghibò 2019-02-27 18:06:12 CET
Created attachment 10801 [details]
nvidia-cuda 10.1 spec file
Comment 17 Giuseppe Ghibò 2019-02-27 18:17:30 CET
Here is the spec file for version 10.1. The runfile can be downloaded from:

https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.105_418.39_linux.run

the spec file still needs some work, e.g. subpackage -nsight provides a set pf bundled libQt5*.so* libraries that interferes with the Provides of the system ones and must be placed in a requires_exclude list. As for further enhancements: man pages could be exported to %{_mandir}, and cublas and other cuda libraries could be libified, so to keep one major version (e.g. 9, 10, etc.).
Comment 18 Richard Walker 2019-02-27 19:40:31 CET
(In reply to Giuseppe Ghibò from comment #17)

It sounds like you have a plan. I am envious of your skill with this monster. I will be checking only the cuda toolkit and devel packages for now as my focus is really just on getting Blender to work a bit faster, and maybe later exploring OpenCL assistance with other programs. 

I'll put on another pot of coffee and get cracking on this rpm build. Thanks again...
Comment 19 Richard Walker 2019-02-27 22:01:05 CET
The build completed OK and I have installed the 10.1.105 CUDA toolkit and devel rpms. The first test was to run the Mageia Cauldron Blender and render the default cube. The CUDA kernel build process completed without error in a couple of minutes and the cube rendered correctly.

All other renders I have tried have frozen before completion, usually with all but 3 tiles rendered - regardless of the total number of render tiles. I suspect the new CUDA kernel build but I have no evidence yet. This could take a few hours I'm afraid.

Richard
Comment 20 Richard Walker 2019-02-28 01:23:30 CET
Created attachment 10804 [details]
failed render from cauldron Blender
Comment 21 Richard Walker 2019-02-28 01:40:20 CET
I think I am more confused than ever, but I have a couple of tentative conclusions based on a number of test renders.

I have tried to render a number of models in a variety of Blender versions. In particular I have used a version of Blender equivalent (from the same day) to the version currently in Cauldron. 

I also removed the pre-compiled CUDA kernels from yesterday's nightly build of Blender 2.80 beta1 and tested that too.

The results with cuda_10.1.105_418.39_linux :

Using Cauldron's Blender 2.79 git build from 22 Feb (I think) the kernel build was completed in about 2 minutes. The model render failed to complete. Only 9 of the 12 render tiles were finished and the render engine froze. See pic 2019-02-27 23-32-40-blender2.79git-mga7.png above.

Using the Blender.org equivalent build from Feb 22 I deleted the pre-compiled CUDA kernels in my blender-2.79-3b86c99260bc-linux-glibc224-x86_64/2.79/scripts/addons/cycles/lib directory and ran Blender, loaded the model and started a render. The missing kernels provoked a new kernel build which completed in the usual 120 seconds or so but the render took as long as a CPU-only render; about six and a half minutes. The new kernel had been put in the "wrong" place for a Blender.org build so I transferred the kernel and filter files to blender-2.79-3b86c99260bc-linux-glibc224-x86_64/2.79/scripts/addons/cycles/lib and hit F12 (render) again. This render completed normally in about the right time for a GPU Compute render; a little under four minutes. See pic 2019-02-28 00-37-07-blender.org2.79-3b86c99260bc.png below.
Comment 22 Richard Walker 2019-02-28 01:42:25 CET
Created attachment 10805 [details]
successful render with cuda_10.1.105_418.39_linux
Comment 23 Richard Walker 2019-02-28 02:00:16 CET
Doing the same delete/rebuild/install dodge for yesterday's nightly Blender 2.80 build also succeeded in using the CUDA 10.1 kernel. See pic below.

With the new CUDA kernel appearing to work in Blender.org builds which have been hacked to use freshly compiled CUDA kernels and the Cauldron rpm of Blender crashing with the new kernel I have no idea what is going on.

At first I thought it might be that the Blender CUDA code really is sensitive to the CUDA version (8 versus 10.1), but getting it to work in all 2.79 and 2.80 Blender.org builds which I have tried certainly puts that theory in some doubt. 

I will need to do some more work on the Cauldron version of Blender to determine if there may yet be a problem with how we prepare its rpm for release. 

In the meantime I would tentatively vote in favour of this experiment with CUDA 10.1
Comment 24 Richard Walker 2019-02-28 02:01:43 CET
Created attachment 10806 [details]
successful render with cuda_10.1.105_418.39_linux and Blender 2.80 b1
Comment 25 Giuseppe Ghibò 2019-02-28 17:48:05 CET
Does cuda-z compiles for you? (package sources can be taken from http://svnweb.mageia.org/packages/cauldron/cuda-z/current/).

Furthermore, if IIRC, blender has the possibility to (re)compile offline fresh cuda kernel cubins, e.g. adding this flag to the cmake configuration stage:

-DWITH_CYCLES_CUDA_BINARIES:BOOL=ON

should do the job, providing cubins for all the cuda architectures from 3.0 to 7.5. Alternatively with:

-DCYCLES_CUDA_BINARIES_ARCH:STRING=sm_61

you can specify a single architecture (in this case sm_61 for the GTX1050Ti).
Comment 26 Richard Walker 2019-02-28 20:46:15 CET
(In reply to Giuseppe Ghibò from comment #25)


There is no sign of the actual cuda-z source; cuda-z-0.11.273.tar.xz
I tried looking for the cuda-z source rpm but unless I am doing something really stupid, I can't find that either.

I tried following the comment in the spec file and I do now have a copy of the source from subversion, but it is revision 291, not 273. Would that do?  

The spec file also mentions a few other patch files which don't appear to be in http://svnweb.mageia.org/packages/cauldron/cuda-z/current/SOURCES/

Again, I am sure there is a very simple answer, but my knowledge doesn't stretch that far :~(
Comment 27 Richard Walker 2019-02-28 21:09:24 CET
OK, I found the source archive in cuda-z-0.11.273-1.mga6.nonfree.src.rpm

I still need to find:

Patch1:	    cuda-z-0.11.273-fix-host-defines-include.patch
Patch2:	    cuda-z-0.11.273-path-and-verbose.patch
Patch3:	    cuda-z-0.11.273-add-extra-arch.patch
Comment 28 Giuseppe Ghibò 2019-02-28 22:55:57 CET
Retry here: 

http://svnweb.mageia.org/packages/cauldron/cuda-z/current/SOURCES/
Comment 29 Richard Walker 2019-02-28 23:19:51 CET
Got it. Building now...
Comment 30 Richard Walker 2019-02-28 23:27:33 CET
It is running the command;

#$ cicc --c++14 --gnu_version=80300 --allow_managed   -arch compute_70 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0000198b_00000000-2_cudainfo.fatbin.c" -tused -nvvmir-library "/bin/../lib64/nvvm/libdevice/libdevice.10.bc" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_0000198b_00000000-3_cudainfo.module_id" --orig_src_file_name "src/cudainfo.cu" --gen_c_file_name "/tmp/tmpxft_0000198b_00000000-5_cudainfo.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_0000198b_00000000-5_cudainfo.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0000198b_00000000-5_cudainfo.compute_70.cudafe1.gpu"  "/tmp/tmpxft_0000198b_00000000-16_cudainfo.compute_70.cpp1.ii" -o "/tmp/tmpxft_0000198b_00000000-5_cudainfo.compute_70.ptx"


... and using 100% of one core and 6G (and rising) of my RAM. Run time is 8 minutes so far for cicc
Comment 31 Richard Walker 2019-03-01 00:05:33 CET
OK, that's 44 minutes and it has filled RAM (14G) and moved into swap. The build has definitely failed.

I'll try that blender rpm rebuild now with the option to build all CUDA kernels.
Comment 32 Richard Walker 2019-03-01 01:26:37 CET
The new blender rpm is in place, but I think I have more work to do on it. It may well have built the CUDA kernels but there is still no sign of them. My guess is that I will have to do a bit more work on the spec to include these new CUDA kernels in the output, but first I have to find out where Blender expects them to be, and what they are called. 

It is late now, Giuseppe, I'll get back to this on Friday evening.

Richard
Comment 33 Giuseppe Ghibò 2019-03-01 17:16:08 CET
The cuda kernels are installed by the blender make install script and placed into:

/usr/share/blender/<blender_release>/scripts/addons/cycles/lib/kernel_sm_<cuda_capability>.cubin

and

/usr/share/blender/<blender_release>/scripts/addons/cycles/lib/filter_sm_<cuda_capability>.cubin

I'll attach a modified version of the blender's .spec file allowing to build offline the cuda cubins, just build with bm (build manager) or rpmbuild from the spec file adding the parameter "--with cuda_cubins", or set the parameter:

%define build_cuda_cubins 1

in the .spec file.
Comment 34 Giuseppe Ghibò 2019-03-01 17:17:14 CET
Created attachment 10811 [details]
blender spec file with conditional flag for building offline cuda kernels
Comment 35 Giuseppe Ghibò 2019-03-01 17:19:43 CET
Created attachment 10812 [details]
blender spec file with conditional flag for building offline cuda kernels

Attachment 10811 is obsolete: 0 => 1

Comment 36 Giuseppe Ghibò 2019-03-01 17:30:05 CET
Created attachment 10813 [details]
nvidia cuda 10.1 spec file

I'm adding a more polished version of the cuda 10.1 spec file, where I excluded the libQt libraries so to avoid dependency problems. I think it will be  the version that will be used to update the current nvidia-cuda-toolkit package to release 10.1.

I think the cuda code compilation works also if you don't have an nvidia card installed (just install the package and the nvidia drivers).

To test nvcc compilaton try this method: install the nvidia-cuda-toolkit-samples package, then copy the CUDA examples to any writable directory of your $HOME, for instance using:

cp -pr /usr/share/nvidia-cuda-toolkit/samples .

then go into samples, and compile everything with make.

cd ./samples
make

actually, it worked flawlessly.

Attachment 10801 is obsolete: 0 => 1

Comment 37 Giuseppe Ghibò 2019-03-01 17:33:37 CET
Created attachment 10814 [details]
blender spec file with conditional flag for building offline cuda kernels

Attachment 10812 is obsolete: 0 => 1

Comment 38 Giuseppe Ghibò 2019-03-01 17:56:26 CET
As for cuda-z, I wonder whether it's a problem of lack of memory (anyone with 32GB or more of RAM can test it?) or some mem leak, or some other kind of bug in cuda or the cuda-z sources, as with previous cuda toolkit versions it was completing without the need of much RAM.
Comment 39 Richard Walker 2019-03-01 18:15:20 CET
Preparing to:

1. rebuild nvidia-cuda-toolkit with the revised spec file
2. test by compiling nvidia-cuda-toolkit-samples
3. rebuild blender-2.79b-14.git20190219.1.mga7.src rpm
4. test CUDA kernel building and Cycles render on GPU
5. rebuild cuda-z rpm against new cuda toolkit.

This will take a while...and a lot of coffee...

The cuda-z thing was strange. The makefile produced a reasonable amount of screen output until it got to that first invocation of nvcc at about line 124 (I'll check that later). Then it just continued to thrash one core and bleed memory. I have 16G ram with 16G swap and run a fairly lightweight LXDE so I started with 13G free (and 2G shared with video). 

I feel very much an observer using rpmbuild to do all the work. I do have the svn copy which is revision 291 so I can try building that the old fashioned way and see if I can get more information about what may be going wrong.
Comment 40 Richard Walker 2019-03-01 19:55:02 CET
Steps 1 & 2 complete without incident - yay!

moving on as planned
Comment 41 Giuseppe Ghibò 2019-03-01 20:49:38 CET
I forgot to tell, that every dir into "samples" contains a certain amount of cuda executables that can be launched and should produce some output (these requires all the other nvidia stuff to be correctly initialized and working).
Comment 42 Richard Walker 2019-03-01 21:24:59 CET
While waiting for the blender re-build I have tried a few random sample programs. So far they have all worked or failed for reasons not related to CUDA (can't find libGL? It didn't try very hard!).
Comment 43 Richard Walker 2019-03-01 22:11:38 CET
Created attachment 10815 [details]
Console output from blender build (422 lines)

Giuseppe,
The blender rebuild, using your modified .spec file, has completed but without building the kernel_sm_ files. I have a file containig the build console output. It is 9MBytes so I searched it for the "kernel_sm_" string without success.

I have attached the first 400 lines or so of this console record. It covers all the rpm stuff up to launching the make. I think it shows how the build was configured. Let me know if you need to see the whole 31000 lines :~)
Comment 44 Richard Walker 2019-03-01 22:27:44 CET
This could be the problem:


at line 304 of attachment 10815 [details]
 
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR) (found version "10.1")
-- CUDA compiler not found, disabling WITH_CYCLES_CUDA_BINARIES

Looks like I need a few more directives or environment variables or constants to fix this.
Comment 45 Richard Walker 2019-03-01 22:41:00 CET
I am trying again with this CUDA section in your .spec file:

%if %{build_cuda_cubins}
	-DWITH_CYCLES_CUDA_BINARIES:BOOL=ON \
	-DCYCLES_CUDA_BINARIES_ARCH:STRING="sm_30;sm_32;sm_35;sm_37;sm_50;sm_52;sm_53;sm_60;sm_61;sm_62;sm_70;sm_72;sm_75" \
        -DCUDA_TOOLKIT_ROOT_DIR:STRING=%{_bindir} \
%endif
Comment 46 Giuseppe Ghibò 2019-03-01 22:59:50 CET
Pretty weird, sound you haven't installed nvidia-cuda-toolkit and nvidia-cuda-toolkit-devel or confusing with cuda toolkit from different sources; I got here:

-- Found CUDA: /usr (found version "10.1") 
-- CUDA nvcc = /usr/bin/nvcc

which produced:

/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_30.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_32.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_35.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_37.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_50.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_52.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_53.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_60.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_61.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_62.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_70.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_72.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/filter_sm_75.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_30.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_32.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_35.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_37.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_50.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_52.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_53.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_60.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_61.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_62.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_70.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_72.cubin
/usr/share/blender/2.79/scripts/addons/cycles/lib/kernel_sm_75.cubin
Comment 47 Richard Walker 2019-03-01 23:08:44 CET
Right then, I'll uninstall the three cuda toolkit rpms and then go hunting for remnants. Then I will re-boot, re-install nvidia-cuda-toolkit, -devel and -samples. Then I will re-build blender.

But first I will wait to see what I get from my modified blender.spec file. If there really is a problem with my toolkit installation it may still fail to find what is so plainly present. It mostly worked for nvidia-toolkit-samples building, though there too it reported missing libGL and missing libGLU (? I think) files.
Comment 48 Richard Walker 2019-03-02 00:46:56 CET
Created attachment 10817 [details]
Console output from home directory build of CUDA sample programs

As we have built nvidia-cuda-toolkit-samples and we are using it, in part, to validate the cuda toolkit installation, I have attached the console output from that build.

Some libraries said to be missing are, in fact, present (eg. libGL and lib64mesagl1-devel, libGLU and lib64mesaglu1-devel). Some of them (vulkan.h for instance) appear not to be available in MGA7 rpms. In general, these programs failed due to libraries not being found:

- libX11.so
- Vulkan SDK
- libvulkan.so
- vulkan.h
samples/2_Graphics/simpleVulkan

- libEGL.so
samples/3_Imaging/EGLStreams_CUDA_Interop
samples/3_Imaging/EGLSync_CUDAEvent_Interop
samples/3_Imaging/EGLStream_CUDA_CrossGPU

- libGL.so
- libGLU.so
samples/2_Graphics/volumeRender
samples/2_Graphics/volumeFiltering
samples/2_Graphics/Mandelbrot
samples/2_Graphics/bindlessTexture
samples/2_Graphics/marchingCubes
samples/2_Graphics/simpleGL
samples/2_Graphics/simpleTexture3D
samples/3_Imaging/imageDenoising
samples/3_Imaging/bilateralFilter
samples/3_Imaging/recursiveGaussian
samples/3_Imaging/bicubicTexture
samples/3_Imaging/simpleCUDA2GL
samples/3_Imaging/boxFilter
samples/3_Imaging/SobelFilter
samples/3_Imaging/postProcessGL
samples/5_Simulations/oceanFFT
samples/5_Simulations/smokeParticles
samples/5_Simulations/nbody
samples/5_Simulations/particles
samples/5_Simulations/fluidsGL
samples/6_Advanced/FunctionPointers
samples/7_CUDALibraries/randomFog

- libEGL.so
- libGLES.so
- libX11.so
samples/2_Graphics/simpleGLES_screen
samples/2_Graphics/simpleGLES
samples/2_Graphics/simpleGLES_EGLOutput
samples/5_Simulations/nbody_screen
samples/5_Simulations/fluidsGLES
samples/5_Simulations/nbody_opengles
Comment 49 Giuseppe Ghibò 2019-03-02 00:56:01 CET
vulkan.h should be contained into vulkan-headers-1.1.92 package
(btw there is release 1.1.101 out upstream). libvulkan.so in
lib64vulkan-loader-devel
Comment 50 Giuseppe Ghibò 2019-03-02 01:18:10 CET
For the other missed, seems it looks in /usr/lib64/nvidia, while nvidia stuff is installed in /usr/lib64/nvidia-current, using:

make GLPATH=/usr/lib64/nvidia-current

should compiled (some of) the missed samples.
Comment 51 Richard Walker 2019-03-02 01:22:51 CET
Created attachment 10818 [details]
Console output from CUDA toolkit sample build of Vulkan example

(In reply to Richard Walker from comment #47)

I have finished my re-build of blender using my modified .spec file. Checking through the console output I seem to have at least 13 different kernel_sm_ files, so that looks good.

I have checked the installed files and all the /usr/share/blender/2.79/scripts/addons/cycles/lib files are present. I am waiting for it to reboot now as the linux kernel has just been updated...

Looking good so far. I will check operation of GPU Cycles rendering next but it is getting late. I may postpone the rebuild of cuda-z until tomorrow.


As for the vulkan stuff, I have installed:

lib64mesavulkan-drivers
lib64mesavulkan-devel
lib64vulkan-loader-devel
vulkan-headers

and re-run the makefile in samples/2_Graphics/simpleVulkan. I attach the result.
Comment 52 Richard Walker 2019-03-02 01:29:01 CET
(In reply to Giuseppe Ghibò from comment #50)

That's OK then, it won't build with mesa files, only nvidia. 

I do have /usr/lib(64)/nvidia symlinked to nvidia-current but of course I had to delete all of the OpenGL stuff from that location to prevent other programs from finding it and trying to use those libraries. 

You might (or might not) be surprised that there are lots of programs which refuse to link to /usr/lib64/libGL.so and will ferret out the nvidia version wherever you have it. I find deleting those files to be the easiest way to get headless nvidia GPU assistance while continuing to display via andgpu and radeonsi.
Comment 53 Richard Walker 2019-03-02 01:53:32 CET
(In reply to Giuseppe Ghibò from comment #49)
I have checked and all the relevant headers and libraries appear to be present and correct. Maybe I need to do a completely fresh rebuild of the samples sources. It may be using previously detected configuration info, before I installed GLFW3 and the vulkan devel stuff.

It may have to wait until Saturday afternoon. I want to check out Blender operation now, and I am trying to debug a boot failure on one of my brother's machines and that's a hundred miles away. Isn't the internet a wonderful thing!

Goodnight for now

Richard
Comment 54 Giuseppe Ghibò 2019-03-02 01:54:15 CET
I think getting working with vulkan is more complicated than that, and requires tweaking specifically for mageia for the file simpleVulkan/findvulkan.mk; also there is another vulkan here https://vulkan.lunarg.com/sdk/home#linux; using this attached findvulkan.mk and installing the packages vulkan-devel, vulkan-headers,   glslang, glfw-devel should at least get the example compiling; another trick attempt could be to try to softlink /usr/lib64/nvidia to /usr/lib64/nvidia-current but probably won't work too.
Comment 55 Giuseppe Ghibò 2019-03-02 01:55:29 CET
Created attachment 10819 [details]
modified findvulkan.mk for mageia support
Comment 56 Richard Walker 2019-03-02 02:29:26 CET
I promise I will try that tomorrow - honest :~)

Right now I am still only rendering all except three of whatever number of render tiles the model image requires in Blender. I am going to try to build CUDA 8, eventually, just to rule out the possibility that this is a Blender problem and the Blender CUDA 8 code really doesn't like to be compiled with CUDA 10.x.

Otherwise the CUDA 10 toolkit appears to work very well, even with the complications I introduce by not having any nvidia OpenGL stuff, and not having the right flavour of vulkan (yet).

In other news, the cuda-z build has failed again, just like yesterday, so when I have got Blender working, found a usable vulkan, sorted out the nvidia GL problem and fixed my brother's boot issue (did I leave anything out?), I will see if a newer cuda-z will help.

Goodnight, really, I am on my way to bed...
Comment 57 Richard Walker 2019-03-02 15:21:40 CET
(In reply to Giuseppe Ghibò from comment #54)

You are so right. Getting the Vulkan example to work is way too complicated for me and beyond the scope of the current problem; first get a working CUDA toolkit with as many checks on its goodness as are provided by resources from Mageia Cauldron, then build Blender (as per modified .spec file) to verify correct operation of CUDA kernels in GPU-assisted Cycles render, and finally verify that cuda-z can be built correctly.

There are a few other anomalies it would be nice to fix along the way; headless operation of CUDA on a system using Mesa, a working Vulkan toolset and getting Nvidia OpenCL running while we are waiting for OpenCL support for AMD Sea Islands and later.

The current situation is that we have a partially validated CUDA 10.1 toolkit, locally built CUDA kernels in modified Cauldron Blender which fail during render, locally built CUDA kernels which appear to successfully replace the pre-built ones distributed with Blender.org nightly binary builds, and a cuda-z build which still hangs on first invocation of nvcc.

I'll be back on this in a couple of hours.
Comment 58 Giuseppe Ghibò 2019-03-02 15:40:33 CET
Created attachment 10822 [details]
nvidiacuda 10.1 spec file

Update of the current cuda 10.1 spec file

Attachment 10813 is obsolete: 0 => 1

Comment 59 Giuseppe Ghibò 2019-03-02 15:41:19 CET
Created attachment 10823 [details]
patch1 for nvidia cuda 10.1 spec file
Comment 60 Giuseppe Ghibò 2019-03-02 15:42:07 CET
Created attachment 10824 [details]
patch2 for nvidia 10.1 cuda spec file
Comment 61 Giuseppe Ghibò 2019-03-02 15:42:36 CET
Created attachment 10825 [details]
patch3 for nvidia cuda 10.1 spec file
Comment 62 Giuseppe Ghibò 2019-03-02 15:43:16 CET
Created attachment 10826 [details]
patch4 for nvidia cuda 10.1 spec file
Comment 63 Giuseppe Ghibò 2019-03-02 15:43:56 CET
Created attachment 10827 [details]
nvidia cuda 10.1 samples binaries spec file
Comment 64 Giuseppe Ghibò 2019-03-02 15:52:10 CET
For cuda-z, later. Passing debugging to nvcc (i.e. -G) would let the compilation pass, but probably the executable will be a lot slower, which means there is still something wrong, or in the code itself or in the optimizer. What is still not clear is whether there is a leak or just a memory hunger of the PTXAS optimizer. 

For cuda toolkit 10.1, the evidence show that this is the only release that could be shipped with mageia7/cauldron, as any older version including 10.0 won't work with the mageia7's gcc compiler.

I updated the cuda 10.1 spec file and I also provided a spec file for compiling the cuda 10.1 samples; this should compile all the cuda samples, including vulkan, and merges all them in a bin dir. The compilation should not require an nvidia card, but of course running those binaries to see whether they are working, it does.
Comment 65 Giuseppe Ghibò 2019-03-02 15:54:20 CET
BTW, blender 2.79 compilation shows this warning:

CMake Warning at intern/cycles/kernel/CMakeLists.txt:349 (message):
  CUDA version 10.1 detected, build may succeed but only CUDA 9.0, 9.1 and
  10.0 are officially supported

so probably 10.0 works but they haven't yet checked the code.
Comment 66 Giuseppe Ghibò 2019-03-02 15:55:29 CET
s/10.0 works/10.1 works/
Comment 67 Richard Walker 2019-03-03 00:53:08 CET
Giuseppe,

I have built and installed:

nvidia-cuda-toolkit
nvidia-cuda-toolkit-devel
nvidia-cuda-toolkit-samples
nvidia-cuda-toolkit-samples-bins
blender

There appears to be little difference building and using the toolkit compared to yesterday. That's good, but not exciting.

The new samples binary package built and installed quietly enough, but it has produced a different number of programs in samples/bin/x86_64/linux/release (170 files) compared with yesterday's manual build in my home directory source tree where samples/bin/x86_64/linux/release has 148 files. Presumably the difference is due to your patches fixing installed library discovery and linkage.

The downside is that I think 170 is still not the expected total. At least 1 file wasn't built; simpleVulkan. Not a big surprise, perhaps, but also simpleGLES, simpleGLES_EGLOutput, simpleGLES_screen and maybe others.

I have not yet checked the operation of all 170 sample programs but I have noted some general truths. Any sample program which has nothing to do with screen manipulation works correctly and passes any built-in tests.

Other sample programs are failing with a variety of errors. For instance:

[richard@Midnight6 release]$ /usr/share/nvidia-cuda-toolkit/samples/bin/x86_64/linux/release/Mandelbrot
[CUDA Mandelbrot/Julia Set] - Starting...
GPU Device 0: "GeForce GTX 1050 Ti" with compute capability 6.1

Data initialization done.
Initializing GLUT...
OpenGL window created.
Creating GL texture...
Texture created.
Creating PBO...
CUDA error at Mandelbrot.cpp:971 code=304(cudaErrorOperatingSystem) "cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, gl_PBO, cudaGraphicsMapFlagsWriteDiscard)" 

[richard@Midnight6 release]$ /usr/share/nvidia-cuda-toolkit/samples/bin/x86_64/linux/release/simpleGL
simpleGL (VBO) starting...

GPU Device 0: "GeForce GTX 1050 Ti" with compute capability 6.1

CUDA error at simpleGL.cu:422 code=304(cudaErrorOperatingSystem) "cudaGraphicsGLRegisterBuffer(vbo_res, *vbo, vbo_res_flags)" 
CUDA error at simpleGL.cu:434 code=400(cudaErrorInvalidResourceHandle) "cudaGraphicsUnregisterResource(vbo_res)" 


There was one other which complained about some GL extensions not being available, but I haven't found it again. Tomorrow I will go through them from 1 to 170 and record the failures.
Comment 68 Richard Walker 2019-03-03 01:05:41 CET
Moving on to Cauldron's Blender labelled 2.79b (actually a git snapshot from 19 Feb). There is no change in the manner in which GPU-assisted Blender Cycles renders fail having rendered all but three of the tiles to be rendered.

Blender.org binary downloads continue to operate correctly when the supplied cycles/lib/ directory is replaced by the one we built with toolkit 10.1

As before, console logs are available for each of these builds, should you wish to see them. Toolkit build log is 510kB, samples-bin log is 399kB and Blender is 10.7MB
Comment 69 Giuseppe Ghibò 2019-03-03 01:34:08 CET
So none of the samples in 2_Graphics works?

As for blender, is the blender RPM built from the spec file with the generated cubins works (i.e. from the spec file provided)? Or just works when taking those cubins and merged into latest blender site (which date?) binaries?

What if upgrading the blender git version in the spec file to the current git?
Comment 70 Giuseppe Ghibò 2019-03-03 01:57:14 CET
Created attachment 10828 [details]
patch4 for nvidia cuda 10.1 spec file

Updated patch4 for vulkan/simpleVulkan.

Attachment 10826 is obsolete: 0 => 1

Comment 71 Giuseppe Ghibò 2019-03-03 02:02:52 CET
nvidia-smi show a correct output during operations?
Comment 72 Richard Walker 2019-03-03 02:23:35 CET
Created attachment 10829 [details]
Console output running 2_Graphics sample programs

Thanks for the Vulkan update, I'll update the build accordingly. 
The Blender .spec file is the one with your added cubins stuff, plus my "-DCUDA_TOOLKIT_ROOT_DIR:STRING=%{_bindir}", and for my own convenience I changed "%define rel 2" 'cos I can get confused which rpm is installed.

The resulting Blender rpm has the full set of CUDA kernels and filters in cycles/lib/.

To test a Blender.org nightly build I empty its cycles/lib/ directory which forces a CUDA kernel/filter build which ends up in ~/.cache/cycles/kernels with a name like cycles_[filter|kernel]_sm61_0D29129DD439CCE0DA8F8CD2A681C9A1.cubin. The big number is different for each version/build of blender.

The attached file is the console output from running each of the toolkit samples/2_graphics/ programs. They all either fail, or are not present!
Comment 73 Richard Walker 2019-03-03 02:39:08 CET
(In reply to Giuseppe Ghibò from comment #69)
"What if upgrading the blender git version in the spec file to the current git?"

That will probably work too, though I predict that the Cycles GPPU renders will still stop 3 tiles short of completion.

I will fetch a snapshot as well as the equivalent binary download. Actually, I already have the latest download. It is dated 2019-03-01 00:41:28 and works fine when forced to build its own kernel/filter sm_61 using toolkit 10.1
Comment 74 Richard Walker 2019-03-03 02:45:27 CET
(In reply to Giuseppe Ghibò from comment #71)

I usually only run nvidia-smi when using Blender, and then not always - only if I need confirmation that I have CUDA set up correctly after an nvidia-current update. I usually have to force an initrd rebuild, don't ask me why, after every nvidia update and very occasionally after a kernel update too. 

I'll make sure I have it running on a spare screen if you like...
Comment 75 Richard Walker 2019-03-03 03:04:07 CET
I am hitting another snag. I don't know where to look to get a git snapshot. The current master is version 2.80 but we would want the latest 2.79 development version which produced the 1st March binary download.

My git skills are not up to that, in fact I don't have any :~[
Comment 76 Richard Walker 2019-03-03 03:50:08 CET
Finally for tonight, I removed the installed toolkit 10.1 rpms and it forced blender out as well. Not a welcome dependency, but hopefully it is only temporary.

I will rebuild with the new Vulkan patch on Sunday.
Comment 77 Giuseppe Ghibò 2019-03-03 13:55:41 CET
Created attachment 10831 [details]
script for downloading current blender 2.7 git

I add this script for downloading the current git 2.7 branch and produce the daily tarball. The script shows some error like 'Previous HEAD position was d4100298 x3d import: make it work without internet connection' but that comes from internal blender commands, so it should be ok. In this way is it possible to compare the same blender binaries from upstream with current RPM build.

I also suspect that blender is very sensitive to cuda 10, as I see commit like this:
 
https://git.blender.org/gitweb/gitweb.cgi/blender.git/commit/f63da3dcf59f87b34aa916b2c65ce5a40a48fd92

which apparently seems building using two different cuda sources in the same build tree, one for 10 and one for 9.
Comment 78 Giuseppe Ghibò 2019-03-03 14:00:41 CET
Created attachment 10832 [details]
blender spec file with conditional flag for building offline cuda kernels

Attachment 10814 is obsolete: 0 => 1

Comment 79 Richard Walker 2019-03-03 15:59:26 CET
(In reply to Giuseppe Ghibò from comment #70)

simpleVulkan is now generated correctly(?) and executes but...

Instance created successfully!!
WARNING: radv is not a conformant vulkan implementation, testing use only.
Selected physical device = 0x247ff00
Swapchain created!!
failed to open shader spv file!

It looks like the frag.spv and vert.spv files are generated by the makefile but then clobbered before the sample is installed. 

I was able to run /usr/bin/glslangValidator -H shader_sine.[frag,vert] to re-create these files and see what they contain. Everything looked OK so I re-ran the simpleVulkan example and got this:

[richard@Midnight6 release]$ ./simpleVulkan
Instance created successfully!!
WARNING: radv is not a conformant vulkan implementation, testing use only.
Selected physical device = 0x19ccf00
Swapchain created!!
Pipeline created successfully!!
CUDA error at vulkanCUDASinewave.cu:1510 code=1(cudaErrorInvalidValue) "cudaImportExternalMemory(&cudaExtMemVertexBuffer, &cudaExtMemHandleDesc)"
Comment 80 Richard Walker 2019-03-03 16:48:47 CET
Thanks Giuseppe, I have the .tgz in place with the new .spec and new knowledge; pigz can fly!

I didn't know about that one until your script told me I didn't have it :~)

I should have a report in an hour or so...
Comment 81 Richard Walker 2019-03-03 17:46:14 CET
Delay. Just had a crash - PCManFM, the LXDE file manager (and a bit more perhaps). It happens from time to time, but this time it caused a make error. That's the first time a file manager crash has affected another process other than the task bar.

Yes, it is a bug but no, I haven't filed it ... yet. It is very unpredictable and impossible, so far, to deliberately cause it to happen.

Meanwhile I have re-started rpmbuild. Hopefully it will be ok this time if I resist the temptation to go browsing through the Cycles addon sources.
Comment 82 Richard Walker 2019-03-03 20:17:40 CET
Created attachment 10833 [details]
rpmbuild error building blender 2.79 git

rpmbuild failed at 93% complete.

The file manager crash I referred to earlier must have been coincidental. This is the second rebuild after that first crash and all have stopped at the same place.

The attachment contains five files;

CMakeLists-freestyle is CMakeLists.txt from BUILD/blender-2.79b-git20190301/source/blender/freestyle. 
The last line is:
blender_add_lib(bf_freestyle "${SRC}" "${INC}" "${INC_SYS}")

Makefile2 is Makefile from BUILD/blender-2.79b-git20190301/build.

rpmbuild-full-output is the full 9MB console output from rpmbuild

rpmbuild-BUILD-ERROR is the last dozen or so lines from the full console output.

rpm-tmp.Ab7cEz is the file which rpmbuild was excecuting.
Comment 83 Richard Walker 2019-03-03 20:22:03 CET
I will be on the road until about 22:00GMT
Comment 84 Richard Walker 2019-03-04 00:09:03 CET
The good news is that the download from Blender.org dated 2019-03-01 can build and use its own cycles_filter_sm61_C266701B6DA7F04AEAABA3328AC151A4.cubin and cycles_kernel_sm61_C266701B6DA7F04AEAABA3328AC151A4.cubin.
Comment 85 Richard Walker 2019-03-04 01:11:57 CET
Created attachment 10835 [details]
Console output with verbose nvcc

Meanwhile I have re-run the cuda-z build with your .spec and patches.

It hangs with the results in the attached file.

Some of the files referenced are expected to be found in /tmp. I have added the relevant file list to the bottom of the console output.
Comment 86 Giuseppe Ghibò 2019-03-04 16:48:10 CET
Created attachment 10837 [details]
blender spec file with conditional flag for building offline cuda kernels

the blender build fails due to an upstream bug in the code, just rerun the script for downloading the current git as of today 20190304 and it will complete the compilation.

Attachment 10832 is obsolete: 0 => 1
Attachment 10833 is obsolete: 0 => 1

Comment 87 Richard Walker 2019-03-05 00:16:13 CET
Got there ahead of you. I have successfully built two versions of the March 4 git; one with and one without the CUDA kernels build.

With the kernels included in the rpm:
test render completes in 1min 5sec
With the kernels omitted and built on-demand
test render completes in 1min 5sec
Using Blender.org build of March 4, the included kernel
test render completes in 1min 13sec
Using Blender.org build of March 4, kernels omitted and built on-demand
test render completes in 1min 9sec

Locally built CUDA kernels in all cases perform slightly faster!
Comment 88 Richard Walker 2019-03-05 01:14:52 CET
Created attachment 10839 [details]
Console output - failed build

I tried rebuilding cuda-z without the extra architectures; sm_70 sm72 sm75

It made little difference. cicc now hangs while working on sm62 instead.
Comment 89 Richard Walker 2019-03-05 02:52:23 CET
Apologies for Comment 87 - you had the answer before I was even home from work! Your spec file is essentially the same as mine.

I have since completed testing the Blender.org 2.80 download with similar successful results.

With the supplied CUDA kernels the test render finished in 1min 8sec
With the CUDA 10.1 kernel built on demand the render time was 1min 05sec
With a prebuilt CUDA 10.1 kernel in place of those supplied : 1min 05sec

All tests of nvidia-cuda-toolkit-10.1.105-1.mga7.x86_64.rpm and its -devel- have now been completely successful with the March 4 Blender 2.79 built as blender-2.79b-14.git20190304.1.mga7.x86_64.rpm either with or without the inclusion of built CUDA kernels.

All tests of nvidia-cuda-toolkit-10.1.105 have also been successful in building and running the sample programs from nvidia-cuda-toolkit-samples-10.1.105-1.mga7.x86_64.rpm with your fixes for Vulkan and GL/EGL/GLES. Nevertheless there are still some problems at run time for all example programs in 2_Graphics and 3_Imaging and elsewhere.

This may be a problem associated with the Mesa implementation for AMD Kaveri which has been exposed by VirtualGL in particular (https://bugs.mageia.org/show_bug.cgi?id=23990 and https://groups.google.com/d/msg/virtualgl-users/orJUPt0a94o/OjrcvIy_AgAJ) and any program trying to use 24bit pbuffers in general. As such it might be beyond the scope of this bug.

I could set up a machine to test all of this using, say, nouveau. I would be nervous about doing this because I have always found it difficult to remember how to get CUDA working on a card which isn't being used for screen output.
Comment 90 Richard Walker 2019-03-10 16:08:32 CET
Giuseppe,
I have been trying to come up with a test environment which I can use to check the use of nvidia-cuda-toolkit-10.1.105 example programs in OpenGL environments other than nvidia-current and Mesa's amdgpu/radeonbsi support.

I don't think I can do this with the hardwaqre I have available;

AMD APUs, Nvidia GTX 960 and 1050 and an old 6200.

Furthermore, I have uncovered another bug which affects the current Cauldron version of Blender, and all releases of Blender since about 9 January 2019. I have reported this at https://developer.blender.org/T60379.

In the absence of any other test results I am happy to close this bug as solved for my specific combination of hardware and software;

AMD A10-7860 Kaveri screen drivers and Mesa
Nvidia GTX1050 GPU for CUDA only using 418.43-1.mga7.nonfree and nvidia-cuda-toolkit-10.1.105

Would you agree?
Comment 91 Giuseppe Ghibò 2019-03-11 15:14:25 CET
Let's wait a bit. I think the nvidia-cuda-toolkit.spec can be soon merged to current cauldron svn, as actually is already better than current one 9.1 that is not working anyway. Regarding the other tests involving also OpenGL, I think that the nvidia-current should be also enabled as device driver and the GL libraries switched to nvidia proprietary ones. 

/usr/sbin/update-alternatives --set gl_conf /etc/nvidia-current/ld.so.conf

should do the switch manually, but it also need to be enabled/configured in the rest of Xorg configuration to use the proprietary drivers.

As for bug https://developer.blender.org/T60379, it talks about Win10 version and also is for blender 2.80 series. Is that number right, or was a typo?

AS for an old nvidia 6200, I don't think it has supported CUDA capabilities, whose list is here:

https://developer.nvidia.com/cuda-gpus
Comment 92 Richard Walker 2019-03-11 19:38:52 CET
(In reply to Giuseppe Ghibò from comment #91)

I think that there may be a variety of issues with osd-3.3.3, but I am only sort of guessing. The Blender 2.80 report was, I think, a little misleading as it describes a way, which I can duplicate, to provoke a massive memory leak. You have to be quick to catch it and kill blender, but you get all your memory back.

At the moment we do not package 2.80 and it is beta 1 after all, maybe beta 2, but certainly still getting many bug fixes and improved existing features.

I suspect (again I am only guessing) that as part of the backporting of improvements from 2.80 to 2.79, we got something a little unexpected, some time in early January. It is interesting to see how the startup time of a January 6 Blender 2.79 is very very quick and the next one I have, around January 9, is noticeably slower.

The reason I am banging on about OSD is that it has never been included in our Blender build, but is part of the binary releases from Blender.org. When I rebuild our current Cauldron Blender with opensubdiv-3.3.3 it inherits the instability I recorded in the T60379 bug report. It is exactly the same behaviour that my copy of the nearest date Blender.org daily exhibits, and it is a hard crash.

When I load my test file, as included in the T60379 report, into our Cauldron Blender, with no OSD, then adaptive subdivision, classified as an "experimental" feature, simply does not work. Neither the simple torus, nor the vehicle roadwheel appear smoothly curved. They both display the underlying jaggedness of the simple, unmodified mesh geometry.

The vehicle model from which the roadwheel was copied was developed in Blender 2.79 and has not shown this viewport anomaly until I saw it in the Mageia Cauldron git rpm from February, and ever since.

I didn't see it in the ship model I was using for our CUDA render tests, but if I look closely I may find it in the wheelhouse :~(

As for testing with Nvidia GL, if I am careful about backing up critical files (and there are a few of those - getting amdgpu working properly was not trivial) I am sure I can disable the onboard graphics and set up the nvidia card to take at least one of my screens, That should do for testing, I reckon. 

I will tackle that very soon, perhaps Wednesday. For now I am preparing two systems for migration to Cauldron, They are sound studio machines and I have a number of applications to rebuild for current Cauldron drivers, glibc and gcc.
Comment 93 Giuseppe Ghibò 2019-03-11 23:04:53 CET
You can use ulimit -Sv 8000000 before running blender to limit the amount of memory it can allocate without having it leaks the whole system memory (8000000 means about 8GB).
Comment 94 Richard Walker 2019-03-13 22:56:10 CET
Created attachment 10869 [details]
borked xorg.conf

I have been struggling with the change to using the Nvidia card as the screen driver. 

In a little over two hours I have achieved some progress;

Backed up and replaced my grub/menu.lst file
Disable on-board graphics in the BIOS
Removed, reboot and configured the nvidia graphics via XFdrake
Rebuilt initrd to get rid of the amdgpu driver 

My Xorg.0.log tells me that my monitor is connected to DFP-1 and that DFP-0, DFP-2 and DFP-3 are disconnected. I can only see three sockets on the back of the card - DVI, HDMI and a flat looking one which might be Display Port(?). I only have an HDMI lead for this monitor, so the others don't matter.

I am pretty sure the desktop is starting and displaying on a port I can't use. My 2 hours+ struggling has been directed to trying to make the proper changes to my xorg.conf file to get the damn thing to put the picture where I can see it.

In case I am doing something really stupid which everyone else knows about, I have attached it here.
Comment 95 Richard Walker 2019-03-14 20:56:44 CET
Created attachment 10870 [details]
It looks like it should be working

I am close to the end of the road with this one. As far as I can tell from the log I should be looking at my MGA7 screen on the monitor it is attached to. 

The log tells me that the monitor has been detected on DFP-1, its resolution has been set correctly (1929x1080), and the monitor is used during the system boot, right uo to the graphical login screen. It just goes black when I log in.

The only evidence I have that it doesn't work properly is the blackness of the screen.

The only thing I can think to do is buy a DVI cable and disconnect the HDMI.

Should I raise a bug for this too? The list is growing....
Comment 96 Giuseppe Ghibò 2019-03-14 21:47:40 CET
Maybe it's not able to query correctly the EDID informations. What if you provide the monitor modeline manually just for your Acer monitor, in the "Monitor" Section of xorg.conf, and add a:

Option         "UseEDID" "False"
Option         "ModeDebug" "True"

to the "Screen" Section? You can get some EDID informations with:

urpmi monitor-edid

and with:

monitor-get-edid | monitor-parse-edid

should give the ModeLine info.
Comment 97 Richard Walker 2019-03-14 23:26:53 CET
Created attachment 10871 [details]
new debug xorg.conf with no EDID

Attachment 10869 is obsolete: 0 => 1

Comment 98 Richard Walker 2019-03-14 23:38:47 CET
Created attachment 10872 [details]
results from new xorg.conf

This took much too long - sorry.
To get the EDID from the 27" Acer I had to plug it into this machine and run your "monitor-get-edid | monitor-parse-edid" as it failed when run on the target PC tty2 or via ssh from this PC.

The result looks just like the mode line reported previously and the effect is exactly the same - no screen, despite the log indicating it all worked properly.

It seems this "black screen on nvidia HDMI 10x0 series" is not unique to me, but nobody seems to have a definitive answer. The "solutions" mostly involve either backing off to an earlier nvidia driver or rebuilding/reinstalling the current driver.

I think it just doesn't work. I'll get a DVI lead tomorrow - my brother can make good use of it when I have finished this nvidia-cuda-toolkit test (almost forgot what this was all about :~)

Attachment 10870 is obsolete: 0 => 1

Comment 99 Richard Walker 2019-03-16 01:58:36 CET
The DVI connection worked but I had no GL. I tried everything "proper" to fix it and in the end I had to hide the Mesa lib64/libGL.so and substitute symlinks to lib64/nvidia-current.

That worked for all tests so far; foobillard, glxspheres64 and Blender. Unfortunately Blender no longer finds the card for CUDA support!

I have been hacking at this for hours and I am now so far away from my original "default" configuration I begin to wonder if I will be able to get back to it when this toolkit testing is done.

I will take a longer closer look at the nvidia stuff to see if I can spot what has gone missing, but I'll get a proper night's sleep first.
Comment 100 Giuseppe Ghibò 2019-03-16 10:54:28 CET
Probably it misses some of the "slave" softlinks that update-alternatives set or there are some "debris" from previous configuration. 

The problem is that this situation seems not that uncommon, but we still remain in a vague position, as this can't yet be transformed in a "fixing" script out of the box (we don't know exactly which softlink is missed, which nvidia*.ko files is missed or interferes in the installation). In the bug https://bugs.mageia.org/show_bug.cgi?id=24436 there has been recently, near the end part of bug, a procedure to restore the nvidia drivers to get working properly. Many people had success following it (the 2nd problem is that once you fixed you no longer know exactly what was the culprit, admitting that it was a single one).

With the nvidia drivers working, CUDA should be available inside blender properly, and with latest blender 2.7x git there should appear also in the CUDA menu, a checkbox that even allow "hybrid" rendering, i.e. use both CUDA and CPU at the same time to perform rendering.

Backing with CUDA, don't forget there is also an init script /etc/init.d/nvidia that is run once and that set the CUDA node devices and permission properly. Also nvidia-smi has an option called "--persistence-mode" that can be used to get CUDA stuff pre-initiazed. This can be useful to avoid the little initializing delay in the case an nvidia card is used remotely as rendering machine, and don't have the X11 stuff that would preload it. I was evaluating whether such extra command could be merged in the cuda toolkit /etc/init.d/nvidia script, but that would require that every cuda card (think to multiple nvidia card) is detected and the --persistence-mode is sent to each of them.
Comment 101 Richard Walker 2019-03-19 02:21:12 CET
I am back from my brother's place with a DVI lead and a working monitor on the Nvidia card. I have also discovered why CUDA seemed to stop working; I was missing some essential devel rpms which are now re-installed. A quick check with Blender tells me CUDA operation is restored and I can now return to the testing of the nvidia-cuda-toolkit package.

I would seem that in preparing for the change to the nvidia screen driver by removing the installed packages, it caused a lot of other stuff to be removed and I didn't notice. 

I will continue on Tuesday evening
Comment 102 Richard Walker 2019-03-20 00:04:37 CET
I have finished re-testing the "samples" build from the nvidia-cuda-toolkit and I think I know now why there are 31 samples which fail to build.

The initial suspect was my mangled Nvidia OpenGL installation, and before that it was suspected that my Mesa OpenGL (for the AMD A10 APU) was not up to the task.

In fact it seems that Nvidia has included distribution-specific tests to find various required libraries and they all fail because Mageia isn't Red Hat or Fedora.

The tests are done by these files:
findgleslib.mk
findgllib.mk
findegl.mk

2_Graphics/simpleGLES_screen/findgleslib.mk
2_Graphics/volumeRender/findgllib.mk
2_Graphics/simpleGLES/findgleslib.mk
2_Graphics/volumeFiltering/findgllib.mk
2_Graphics/Mandelbrot/findgllib.mk
2_Graphics/bindlessTexture/findgllib.mk
2_Graphics/simpleGLES_EGLOutput/findgleslib.mk
2_Graphics/marchingCubes/findgllib.mk
2_Graphics/simpleGL/findgllib.mk
2_Graphics/simpleTexture3D/findgllib.mk
3_Imaging/imageDenoising/findgllib.mk
3_Imaging/EGLStreams_CUDA_Interop/findegl.mk
3_Imaging/bilateralFilter/findgllib.mk
3_Imaging/recursiveGaussian/findgllib.mk
3_Imaging/bicubicTexture/findgllib.mk
3_Imaging/simpleCUDA2GL/findgllib.mk
3_Imaging/EGLSync_CUDAEvent_Interop/findegl.mk
3_Imaging/boxFilter/findgllib.mk
3_Imaging/EGLStream_CUDA_CrossGPU/findegl.mk
3_Imaging/SobelFilter/findgllib.mk
3_Imaging/postProcessGL/findgllib.mk
5_Simulations/nbody_screen/findgleslib.mk
5_Simulations/fluidsGLES/findgleslib.mk
5_Simulations/oceanFFT/findgllib.mk
5_Simulations/smokeParticles/findgllib.mk
5_Simulations/nbody/findgllib.mk
5_Simulations/particles/findgllib.mk
5_Simulations/fluidsGL/findgllib.mk
5_Simulations/nbody_opengles/findgleslib.mk
6_Advanced/FunctionPointers/findgllib.mk
7_CUDALibraries/randomFog/findgllib.mk
Comment 103 Richard Walker 2019-03-20 02:37:11 CET
The results for 165 compiled sample programs:

5_Simulations/fluidsGL appears to just hang with a green textured screen. It might be the "right" result and ESC will quit.

Everything else either passed or failed as expected (eg. only one GPU when two were required)

Looks like a good'un. I am switching back to amdgpu screen and Mesa now (or tomorrow).

I could recommend that the various findxxxx.mk files be patched to work with Mageia. I think you have already done the Vulkan one, or was that for something else..?
Comment 104 Richard Walker 2019-03-20 02:41:50 CET
Thank you Giuseppe for all your hard work. In particular, thank you for the extra work you did on the blender spec and for the script to fetch and pack the Blender dailies from git. I will be putting it all to good use as I try to build OpenShadingLanguage and OpenSubDiv support.

Richard

Status: NEW => RESOLVED
Resolution: (none) => FIXED

Comment 105 Giuseppe Ghibò 2019-03-20 15:50:13 CET
The patches for gles, egl, gl libs were already included in comments 59, 60, 61, 62. With that, cuda toolkit should be able to compile all the 171 samples. In particular the package nvidia-cuda-toolkit-samples-bins (release 3) in non-free can be retrieved with that files compiled. To test them all at once just install the package nvidia-cuda-toolkit-samples-bins, then issue:

for i in /usr/share/nvidia-cuda-toolkit/samples/bin/x86_64/linux/release/*; do echo ${i} && ${i}; done

for the samples involving some graphics an interactive window should popup.
Of course you can compile yourself from the nvidia-cuda-tookit-samples package, by just copying the samples dir in a writable dir run "make -j1". The cuda compilation (and just that) doesn't even require an nvidia card installed.
Comment 106 Richard Walker 2019-03-20 20:16:19 CET
I blame old age for my failing memory. Of course you had already patched the findxxxx.mk scripts. I had a moment of madness at the end of last week - accidentally deleted my rpmbuild directory thinking it was something else. It took an evening to fetch all of the sources, specs and patches again and then I had a weekend to get the nvidia screen and driver to work. 

I built the samples from what I thought was an up-to-date directory in my home but it was older than that. Silly me. I have the Cauldron toolkit updates in place now so I will retry the samples when I have put this system back in its normal working state; AMD screens and Nvidia for CUDA only.

I think you said earlier that the Nvidia CUDA toolkit does not require a working nvidia screen so I will strip out all nvidia rpms and see if I can find the way to do that. Then I will get back to Blender tests.

Thank you again for all you help and guidance.