SoFiA Production Update 2

Message boards : News : SoFiA Production Update 2

To post messages, you must log in.

AuthorMessage
Profile Sam
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 9 Feb 17
Posts: 216
Credit: 7,636
RAC: 0
Message 988 - Posted: 15 Nov 2017, 6:43:10 UTC

The work for this week is broken up in to 4 different runs (16 -> 19) which each contain a quater of the 486 parameters that we're running through SoFiA.

The aim here is to reduce the run time of each work unit, and reduce the amount of data that needs to be re-calculated whenever a workunit is marked invalid.

I'm still investigating the actual cause of these invalid workunits, but I still haven't been able to re-produce the errors locally on my own machines. In the mean time, I hope this quick fix eases the load a little.
ID: 988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yavanius
Volunteer moderator
Avatar

Send message
Joined: 12 Feb 17
Posts: 121
Credit: 163,211
RAC: 2,542
Message 992 - Posted: 17 Nov 2017, 2:25:17 UTC - in response to Message 988.  
Last modified: 19 Nov 2017, 22:59:45 UTC

Noticing a number of invalids, although it seems most were from WUs before today...

I saw one WU had to go through 5 hosts to get a validation. Of that, only 2 accomplished that...
ID: 992 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
G_UK
Avatar

Send message
Joined: 7 May 17
Posts: 20
Credit: 70,696
RAC: 1,218
Message 993 - Posted: 17 Nov 2017, 5:04:48 UTC
Last modified: 17 Nov 2017, 5:12:44 UTC

No invalids or Errors from this latest batch so far.

114 Valid
19 Inconclusives (+14 from the previous batch)

12 Pending (+11 from previous batch)
14 In-Progress

All in all looking better, earlier tonight I did a full restart of my machines as part of maintenance so I will keep an eye out for invalids during that timeframe.

Edit: I cant count
Edit2: I still cant count
Gridcoin: Rx5iQUC9fdZkYuxrjW6ySV6Jfttsw5Ub2L
Bitshares: g-uk https://wallet.bitshares.org/?r=g-uk
Ethereum: 0x734E41c433DE29383957A80dc57B8D025dd326b5
ID: 993 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yavanius
Volunteer moderator
Avatar

Send message
Joined: 12 Feb 17
Posts: 121
Credit: 163,211
RAC: 2,542
Message 997 - Posted: 17 Nov 2017, 20:58:51 UTC - in response to Message 993.  


Edit: I cant count
Edit2: I still cant count


It's okay. I'm sure there's an app for that. ;)
ID: 997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yavanius
Volunteer moderator
Avatar

Send message
Joined: 12 Feb 17
Posts: 121
Credit: 163,211
RAC: 2,542
Message 998 - Posted: 17 Nov 2017, 21:44:35 UTC - in response to Message 993.  

No invalids or Errors from this latest batch so far.


Same here, but still quite a number of Inconclusives. I'm seeing a few on ever page as I scroll through the tasks. Although it looks like the older ones are slowly being cleared out. I'm not even going to try and count... I got 20 pages of tasks just for today and the 16th with the WUs chopped down in size.
ID: 998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SandJ

Send message
Joined: 29 Oct 17
Posts: 10
Credit: 10,173
RAC: 61
Message 999 - Posted: 18 Nov 2017, 12:59:19 UTC - in response to Message 997.  


Edit: I cant count
Edit2: I still cant count


It's okay. I'm sure there's an app for that. ;)

Or even, an ap-ostrophe! ;-)
ID: 999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yavanius
Volunteer moderator
Avatar

Send message
Joined: 12 Feb 17
Posts: 121
Credit: 163,211
RAC: 2,542
Message 1002 - Posted: 19 Nov 2017, 3:22:40 UTC

Had one WU err out after 13 seconds with a disk space limit exceeded:

11/18/2017 19:12:08 | duchamp | Aborting task sofia_19_askap_cube_6_41_21_1: exceeded disk limit: 2911.98MB > 2560.00MB

https://sourcefinder.theskynet.org/duchamp/result.php?resultid=382136

Kind of odd it displays the limit exceeded message and then it looks like the task is starting afterward.
ID: 1002 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
G_UK
Avatar

Send message
Joined: 7 May 17
Posts: 20
Credit: 70,696
RAC: 1,218
Message 1003 - Posted: 19 Nov 2017, 5:29:34 UTC
Last modified: 19 Nov 2017, 5:43:03 UTC

Or even, an ap-ostrophe! ;-)


I'm an Engineer grammar isn't our strong suit, plus Alcohol doesn't help.

-------------

Two Invalid so far this batch and over 60 inconclusive. No seeming correlation with either closing (removing from memory) or suspending (assuming keeps in memory) running app's. One machine running non-stop since the last maintenance cycle has the same results as my gaming machine that I have been suspending each evening to, well game. Full shutdowns during last maintenance didn't give a greater amount of inconclusive tasks.

For completeness both machines running Sourcefinder.
Windows 10 64, Intel i7 6700HQ. 16GB RAM, VBox 5.1.30 with Ext pack
Windows 10 64, Intel i7 920. 12 GB RAM, VBox 5.1.30 with Ext pack
Gridcoin: Rx5iQUC9fdZkYuxrjW6ySV6Jfttsw5Ub2L
Bitshares: g-uk https://wallet.bitshares.org/?r=g-uk
Ethereum: 0x734E41c433DE29383957A80dc57B8D025dd326b5
ID: 1003 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yavanius
Volunteer moderator
Avatar

Send message
Joined: 12 Feb 17
Posts: 121
Credit: 163,211
RAC: 2,542
Message 1004 - Posted: 19 Nov 2017, 22:59:09 UTC - in response to Message 1003.  

Or even, an ap-ostrophe! ;-)


I'm an Engineer grammar isn't our strong suit, plus Alcohol doesn't help.


Remember it was the doctor who brought the alchohol. ;) Although I'm pretty sure Scotty had his own personal stash.

Thanks for the info on the suspended/not suspended. Maybe it was just coincidence on my systems or just BOINC's quirky nature.
ID: 1004 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sam
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 9 Feb 17
Posts: 216
Credit: 7,636
RAC: 0
Message 1005 - Posted: 20 Nov 2017, 0:01:27 UTC

So has chopping the WU's up in to smaller pieces been better or worse overall?
I'm afraid that because there were so many invalids from the last run that they've been polluting this run and it's been hard to tell if anything is actually better.
ID: 1005 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LumenDan

Send message
Joined: 9 Feb 17
Posts: 88
Credit: 164,599
RAC: 610
Message 1006 - Posted: 20 Nov 2017, 8:33:08 UTC - in response to Message 1005.  
Last modified: 20 Nov 2017, 8:41:58 UTC

So has chopping the WU's up in to smaller pieces been better or worse overall?
I'm afraid that because there were so many invalids from the last run that they've been polluting this run and it's been hard to tell if anything is actually better.
I have had some invalid and inconclusive units from batch 16, 17 and 18 at about 11% rate. I haven't had any bad units from batch 19 yet.
Have you considered adjusting the virtual machine settings to see if different configurations have an effect on SoFiA's stability?
ID: 1006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gomeyer

Send message
Joined: 29 Oct 17
Posts: 5
Credit: 27,898
RAC: 186
Message 1007 - Posted: 20 Nov 2017, 10:46:14 UTC - in response to Message 1005.  

So has chopping the WU's up in to smaller pieces been better or worse overall?

I don't have any numbers, but my feeling is that there is little if any difference.
ID: 1007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
G_UK
Avatar

Send message
Joined: 7 May 17
Posts: 20
Credit: 70,696
RAC: 1,218
Message 1008 - Posted: 20 Nov 2017, 23:27:35 UTC - in response to Message 1005.  
Last modified: 20 Nov 2017, 23:34:34 UTC

For me, I've had the following with this batch:
467(ish) Valid
22 Invalid
= 4.71% failure rate which is much improved (It seems more though as there are so many more workunits now)

Still have the following outstanding from this batch:
11 Inconclusive
9 Pending
1 In-progress

I've also validated a handful of workunits with multiple failures from previous batches without generating anymore invalids.

Tip for anyone counting: You can quickly tell the difference between this batch and previous ones in the CPU time, there is a marked difference in time between this and previous batches. You still have to count how many have longer times and subtract this from the number of pages x 20 after filtering the list.
Gridcoin: Rx5iQUC9fdZkYuxrjW6ySV6Jfttsw5Ub2L
Bitshares: g-uk https://wallet.bitshares.org/?r=g-uk
Ethereum: 0x734E41c433DE29383957A80dc57B8D025dd326b5
ID: 1008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sam
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 9 Feb 17
Posts: 216
Credit: 7,636
RAC: 0
Message 1009 - Posted: 22 Nov 2017, 0:02:32 UTC - in response to Message 1008.  

The runs from the latest batch should be about 1/4 the length as the previous batch.

According to the project stats, we've run through all of the work for this batch and last batch (with about 400 wus remaining).
14k workunits sitting in line for the assimilator is bad though, I think I'll see if I can improve the speed of it.
ID: 1009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yavanius
Volunteer moderator
Avatar

Send message
Joined: 12 Feb 17
Posts: 121
Credit: 163,211
RAC: 2,542
Message 1010 - Posted: 22 Nov 2017, 2:59:21 UTC - in response to Message 1009.  

Well, you suddenly increased the number of WUs coming by roughly 4x... so you're still setup for a slower influx. On the bright side, it gives you an excuse to go tweaking things now. :)
ID: 1010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yavanius
Volunteer moderator
Avatar

Send message
Joined: 12 Feb 17
Posts: 121
Credit: 163,211
RAC: 2,542
Message 1011 - Posted: 22 Nov 2017, 3:01:33 UTC - in response to Message 1008.  
Last modified: 22 Nov 2017, 3:01:55 UTC

Tip for anyone counting: You can quickly tell the difference between this batch and previous ones in the CPU time, there is a marked difference in time between this and previous batches. You still have to count how many have longer times and subtract this from the number of pages x 20 after filtering the list.


Or just simply look at the send dates. Granted, you might have some older WUs that needed additional wingmen in there, but probably not a significant number.
ID: 1011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sam
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 9 Feb 17
Posts: 216
Credit: 7,636
RAC: 0
Message 1012 - Posted: 22 Nov 2017, 3:06:47 UTC - in response to Message 1011.  

The issue is the assimilator has to upload a lot of data to our amazon S3 bucket, and it's REALLY slow at doing that for each workunit.
I'm spending some time right now improving the speed of it. Should be able to get at least a doubling in assimilator speed, I think.
ID: 1012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : News : SoFiA Production Update 2


©2017 ICRAR